Fonseka K. N. N. - 216036B

ASSIGNMENT 01¶

Customer Churn Dataset¶

01. Dataset Selection¶

A customer churn dataset offers valuable information on customers who have stayed with or left a company during a specific time period. This data often contains client demographics, service usage, account information, and payment history. The issue of customer turnover is essential because obtaining new customers is frequently more expensive than retaining existing ones. Businesses can use the dataset to create predictive models that identify consumers who are likely to leave. This enables businesses, especially in industries like telecommunications where switching providers is simple, to take proactive actions to prevent churn through targeted marketing and improved customer service, ultimately enhancing client retention and profitability.

You can find the dataset at Kaggle at https://www.kaggle.com/datasets/blastchar/telco-customer-churn.

Importing necessary Libraries¶

In [68]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as pt
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from keras import models
from keras import layers
import tensorflow as tf
import seaborn as sns
import missingno
import matplotlib.pyplot as plt
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score, f1_score
%matplotlib inline

from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.neighbors import KNeighborsClassifier
import xgboost as xgb
import lightgbm as lgb

from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV, cross_val_score
from sklearn.metrics import roc_curve, roc_auc_score

Loading the Dataset¶

In [5]:
# Load the CSV file into a pandas DataFrame
file_path = r"E:\Lectures\Semester 6\Data Mining\Assignment\New\WA_Fn-UseC_-Telco-Customer-Churn.csv"
dataset = pd.read_csv(file_path)

This customer churn dataset is designed to provide insights into customer behavior within a telecommunications business context.

Business Context: Understanding customer churn is vital for businesses as it impacts revenue and growth. By analyzing this dataset, companies can implement targeted marketing strategies, improve customer satisfaction, and ultimately enhance customer retention.

In [6]:
dataset.shape
Out[6]:
(7043, 21)

The Customer Churn Dataset has 7,043 rows and 21 features.

In [7]:
#To check the name of features.
dataset.columns.values
Out[7]:
array(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
       'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract',
       'PaperlessBilling', 'PaymentMethod', 'MonthlyCharges',
       'TotalCharges', 'Churn'], dtype=object)
In [8]:
dataset.head()
Out[8]:
customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity ... DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 7590-VHVEG Female 0 Yes No 1 No No phone service DSL No ... No No No No Month-to-month Yes Electronic check 29.85 29.85 No
1 5575-GNVDE Male 0 No No 34 Yes No DSL Yes ... Yes No No No One year No Mailed check 56.95 1889.5 No
2 3668-QPYBK Male 0 No No 2 Yes No DSL Yes ... No No No No Month-to-month Yes Mailed check 53.85 108.15 Yes
3 7795-CFOCW Male 0 No No 45 No No phone service DSL Yes ... Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 No
4 9237-HQITU Female 0 No No 2 Yes No Fiber optic No ... No No No No Month-to-month Yes Electronic check 70.70 151.65 Yes

5 rows × 21 columns

In [9]:
dataset.tail()
Out[9]:
customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity ... DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
7038 6840-RESVB Male 0 Yes Yes 24 Yes Yes DSL Yes ... Yes Yes Yes Yes One year Yes Mailed check 84.80 1990.5 No
7039 2234-XADUH Female 0 Yes Yes 72 Yes Yes Fiber optic No ... Yes No Yes Yes One year Yes Credit card (automatic) 103.20 7362.9 No
7040 4801-JZAZL Female 0 Yes Yes 11 No No phone service DSL Yes ... No No No No Month-to-month Yes Electronic check 29.60 346.45 No
7041 8361-LTMKD Male 1 Yes No 4 Yes Yes Fiber optic No ... No No No No Month-to-month Yes Mailed check 74.40 306.6 Yes
7042 3186-AJIEK Male 0 No No 66 Yes No Fiber optic Yes ... Yes Yes Yes Yes Two year Yes Bank transfer (automatic) 105.65 6844.5 No

5 rows × 21 columns

The first and last five rows of our dataset are shown in the cell above. I observed that the dataset contains 7043 rows and 21 columns. I noticed a combination of numerical and categorical data columns, which must be transformed to numerical format before training our model. Also, the problem statement states that we are supposed to estimate client attrition, which makes this a Classification problem.

  1. customerID: A unique identifier assigned to each customer, used to track and differentiate between customers.

  2. gender: The gender of the customer (e.g., male or female).

  3. SeniorCitizen: Indicates whether the customer is a senior citizen (1) or not (0). This is a binary feature.

  4. Partner: Indicates whether the customer has a partner (Yes or No).

  5. Dependents: Indicates whether the customer has dependents (Yes or No).

  6. tenure: The length of time (in months) that the customer has been with the company. A higher value typically indicates customer loyalty.

  7. PhoneService: Indicates whether the customer subscribes to phone service (Yes or No).

  8. MultipleLines: Indicates whether the customer has multiple lines (Yes or No). This feature is relevant for customers who have phone service.

  9. InternetService: Describes the type of internet service the customer subscribes to (e.g., DSL, Fiber optic, or No).

  10. OnlineSecurity: Indicates whether the customer has online security features (Yes or No).

  11. OnlineBackup: Indicates whether the customer subscribes to online backup services (Yes or No).

  12. DeviceProtection: Indicates whether the customer has device protection (Yes or No).

  13. TechSupport: Indicates whether the customer has technical support services (Yes or No).

  14. StreamingTV: Indicates whether the customer has a subscription to streaming TV services (Yes or No).

  15. StreamingMovies: Indicates whether the customer has a subscription to streaming movies services (Yes or No).

  16. Contract: The type of contract the customer has (e.g., Month-to-month, One year, or Two year).

  17. PaperlessBilling: Indicates whether the customer has opted for paperless billing (Yes or No).

  18. PaymentMethod: The method by which the customer pays their bills (e.g., Electronic check, Credit card, etc.).

  19. MonthlyCharges: The monthly fee charged to the customer for services.

  20. TotalCharges: The total amount charged to the customer since they started their service.

  21. Churn: Indicates whether the customer has left the service (1) or is still a customer (0). This is the target variable for predicting customer churn.

It includes a mix of categorical and continuous variables, making it suitable for various analytical tasks.

Key Features:

Categorical Variables: Features such as gender, Partner, Dependents, InternetService, Contract, and Churn provide qualitative information about the customers' demographics, service subscriptions, and retention status.

Continuous Variables: Features like tenure, MonthlyCharges, and TotalCharges are numerical, representing the duration of customer relationships and the financial aspects of their accounts.

Target Variable: The target variable, Churn, indicates whether a customer has left the service (1) or remained (0). This binary classification task aims to predict customer churn based on the features provided, enabling the company to identify potential churners and develop strategies for customer retention.

02. Data Preprocessing¶

Data Cleaning¶

In [10]:
dataset.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 
 17  PaymentMethod     7043 non-null   object 
 18  MonthlyCharges    7043 non-null   float64
 19  TotalCharges      7043 non-null   object 
 20  Churn             7043 non-null   object 
dtypes: float64(1), int64(2), object(18)
memory usage: 1.1+ MB
In [11]:
# getting list of object data type column names
object_datatype = []
for x in dataset.dtypes.index:
    if dataset.dtypes[x] == 'O':
        object_datatype.append(x)
print(f"Object Data Type Columns are:\n", object_datatype)

# getting the list of numeric data type column names
number_datatype = []
for x in dataset.dtypes.index:
    if dataset.dtypes[x] == 'float64' or dataset.dtypes[x] == 'int64':
        number_datatype.append(x)
print(f"\nNumber Data Type Columns are:\n", number_datatype)
Object Data Type Columns are:
 ['customerID', 'gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'TotalCharges', 'Churn']

Number Data Type Columns are:
 ['SeniorCitizen', 'tenure', 'MonthlyCharges']

I have successfully bifurcated the object datatype column names and numeric data type column names.

As we can seeTotal Charges are categorized as categorical data (which should be a numeric data) and Senior Citizen are categorized as numeric data (which should be a categorical data)

In [12]:
for col in object_datatype:
    print(col)
    print(dataset[col].value_counts())
    print("="*120)
customerID
customerID
7590-VHVEG    1
3791-LGQCY    1
6008-NAIXK    1
5956-YHHRX    1
5365-LLFYV    1
             ..
9796-MVYXX    1
2637-FKFSY    1
1552-AAGRX    1
4304-TSPVK    1
3186-AJIEK    1
Name: count, Length: 7043, dtype: int64
========================================================================================================================
gender
gender
Male      3555
Female    3488
Name: count, dtype: int64
========================================================================================================================
Partner
Partner
No     3641
Yes    3402
Name: count, dtype: int64
========================================================================================================================
Dependents
Dependents
No     4933
Yes    2110
Name: count, dtype: int64
========================================================================================================================
PhoneService
PhoneService
Yes    6361
No      682
Name: count, dtype: int64
========================================================================================================================
MultipleLines
MultipleLines
No                  3390
Yes                 2971
No phone service     682
Name: count, dtype: int64
========================================================================================================================
InternetService
InternetService
Fiber optic    3096
DSL            2421
No             1526
Name: count, dtype: int64
========================================================================================================================
OnlineSecurity
OnlineSecurity
No                     3498
Yes                    2019
No internet service    1526
Name: count, dtype: int64
========================================================================================================================
OnlineBackup
OnlineBackup
No                     3088
Yes                    2429
No internet service    1526
Name: count, dtype: int64
========================================================================================================================
DeviceProtection
DeviceProtection
No                     3095
Yes                    2422
No internet service    1526
Name: count, dtype: int64
========================================================================================================================
TechSupport
TechSupport
No                     3473
Yes                    2044
No internet service    1526
Name: count, dtype: int64
========================================================================================================================
StreamingTV
StreamingTV
No                     2810
Yes                    2707
No internet service    1526
Name: count, dtype: int64
========================================================================================================================
StreamingMovies
StreamingMovies
No                     2785
Yes                    2732
No internet service    1526
Name: count, dtype: int64
========================================================================================================================
Contract
Contract
Month-to-month    3875
Two year          1695
One year          1473
Name: count, dtype: int64
========================================================================================================================
PaperlessBilling
PaperlessBilling
Yes    4171
No     2872
Name: count, dtype: int64
========================================================================================================================
PaymentMethod
PaymentMethod
Electronic check             2365
Mailed check                 1612
Bank transfer (automatic)    1544
Credit card (automatic)      1522
Name: count, dtype: int64
========================================================================================================================
TotalCharges
TotalCharges
          11
20.2      11
19.75      9
20.05      8
19.9       8
          ..
6849.4     1
692.35     1
130.15     1
3211.9     1
6844.5     1
Name: count, Length: 6531, dtype: int64
========================================================================================================================
Churn
Churn
No     5174
Yes    1869
Name: count, dtype: int64
========================================================================================================================

We can see that the column "TotalCharges" has float value still gets tagged as object data type plus 11 rows of that column has blank data.

In [13]:
dataset['TotalCharges'] = dataset['TotalCharges'].replace(' ' , '0.0')
dataset['TotalCharges'].value_counts()
Out[13]:
count
TotalCharges
0.0 11
20.2 11
19.75 9
20.05 8
19.9 8
... ...
6849.4 1
692.35 1
130.15 1
3211.9 1
6844.5 1

6531 rows × 1 columns


Since there are 11 rows with blank data present in the column "TotalCharges" I have replaced them with the value 0

Also column "TotalCharges" showed as object data type and therefore I am converting it into the float data type now

In [14]:
dataset['TotalCharges'] = dataset['TotalCharges'].astype('float')
dataset['TotalCharges'].dtype
Out[14]:
dtype('float64')

Similarly I am converting the column "SeniorCitizen" from numeric data type to object datatype as it contains categorical information and it will be easier to process it like a category with the other.

In [15]:
dataset['SeniorCitizen'] = dataset['SeniorCitizen'].astype('O')
dataset['SeniorCitizen'].dtype
Out[15]:
dtype('O')
In [16]:
# getting list of object data type column names
object_datatype = []
for x in dataset.dtypes.index:
    if dataset.dtypes[x] == 'O':
        object_datatype.append(x)
print(f"Object Data Type Columns are:\n", object_datatype)

# getting the list of numeric data type column names
number_datatype = []
for x in dataset.dtypes.index:
    if dataset.dtypes[x] == 'float64' or dataset.dtypes[x] == 'int64':
        number_datatype.append(x)
print(f"\nNumber Data Type Columns are:\n", number_datatype)
Object Data Type Columns are:
 ['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'Churn']

Number Data Type Columns are:
 ['tenure', 'MonthlyCharges', 'TotalCharges']

Now I have sucessfully made sure that object data type covers all the categorical column data while the numeric data type has the continous column information stored in it.

In [17]:
# 1. Handling Missing Values
dataset.isnull().sum()
Out[17]:
0
customerID 0
gender 0
SeniorCitizen 0
Partner 0
Dependents 0
tenure 0
PhoneService 0
MultipleLines 0
InternetService 0
OnlineSecurity 0
OnlineBackup 0
DeviceProtection 0
TechSupport 0
StreamingTV 0
StreamingMovies 0
Contract 0
PaperlessBilling 0
PaymentMethod 0
MonthlyCharges 0
TotalCharges 0
Churn 0

In [18]:
missingno.bar(dataset, figsize = (25,5), color="tab:green")
Out[18]:
<Axes: >

The dataset was checked for missing values using the isnull().sum() function, which revealed that there were no missing entries in any of the columns.

In [19]:
# 2. Check for Outliers using Boxplots
# Convert TotalCharges to numeric, coerce errors to NaN
dataset['TotalCharges'] = pd.to_numeric(dataset['TotalCharges'], errors='coerce')

# List of continuous columns
continuous_columns = ['tenure', 'MonthlyCharges', 'TotalCharges']

# Plotting boxplots
plt.figure(figsize=(15, 8))
for i, column in enumerate(continuous_columns, 1):
    plt.subplot(1, 3, i)
    sns.boxplot(y=dataset[column])
    plt.title(f'Boxplot of {column}')
plt.tight_layout()
plt.show()

I have checked for outliers in numerical data using box plots. We can see that there aren't any outliers in the dataset.

In [20]:
#3. Check for Erroneous Data
# Check for Erroneous Data
print("\nChecking for Erroneous Data:")

# 1. Check tenure with negative values
print("\nTenure (months) less than 0:")
print(dataset[dataset['tenure'] < 0])

# 2. Check for MonthlyCharges or TotalCharges less than 0
print("\nMonthlyCharges or TotalCharges less than 0:")
print(dataset[(dataset['MonthlyCharges'] < 0) | (dataset['TotalCharges'] < 0)])
Checking for Erroneous Data:

Tenure (months) less than 0:
Empty DataFrame
Columns: [customerID, gender, SeniorCitizen, Partner, Dependents, tenure, PhoneService, MultipleLines, InternetService, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies, Contract, PaperlessBilling, PaymentMethod, MonthlyCharges, TotalCharges, Churn]
Index: []

[0 rows x 21 columns]

MonthlyCharges or TotalCharges less than 0:
Empty DataFrame
Columns: [customerID, gender, SeniorCitizen, Partner, Dependents, tenure, PhoneService, MultipleLines, InternetService, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies, Contract, PaperlessBilling, PaymentMethod, MonthlyCharges, TotalCharges, Churn]
Index: []

[0 rows x 21 columns]

I have also checked for Erroneous data.

1. Tenure Values Check:

The dataset was checked for any customers with a tenure of less than 0 months.

Results: No entries were found with tenure values less than 0, indicating that all customers have valid subscription durations.

2. Billing Amounts Check:

Both MonthlyCharges and TotalCharges were evaluated to identify any negative values.

Results: There were no entries with negative values for either MonthlyCharges or TotalCharges, confirming the accuracy and validity of the billing data.

We can see that there are no erroneous data.

In [21]:
# 4. Check for Duplicates
print("\nChecking for Duplicate Entries:")
duplicates = dataset.duplicated().sum()
print(f"Number of duplicate entries: {duplicates}")
Checking for Duplicate Entries:
Number of duplicate entries: 0

We can see that there aren't any duplicates.

In [22]:
# Remove the customerID column
dataset = dataset.drop(columns=['customerID'])
In [23]:
dataset.head()
Out[23]:
gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 Female 0 Yes No 1 No No phone service DSL No Yes No No No No Month-to-month Yes Electronic check 29.85 29.85 No
1 Male 0 No No 34 Yes No DSL Yes No Yes No No No One year No Mailed check 56.95 1889.50 No
2 Male 0 No No 2 Yes No DSL Yes Yes No No No No Month-to-month Yes Mailed check 53.85 108.15 Yes
3 Male 0 No No 45 No No phone service DSL Yes No Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 No
4 Female 0 No No 2 Yes No Fiber optic No No No No No No Month-to-month Yes Electronic check 70.70 151.65 Yes

Feature Engineering¶

In [24]:
# 1. One-Hot Encoding for categorical features
def ordinal_encode(dataset, columns):
    oe = OrdinalEncoder()
    # Reshape dataset to 2D and encode the entire dataset
    dataset[columns] = oe.fit_transform(dataset[columns])
    return dataset, oe

# Function to display mappings of original categories to encoded values
def display_mappings(oe, columns):
    for idx, col in enumerate(columns):
        print(f"Mapping for {col}:")
        # Fetch the original categories and their corresponding encoded values
        for original_value, encoded_value in zip(oe.categories_[idx], range(len(oe.categories_[idx]))):
            print(f"  {encoded_value}.0 = {original_value}")
        print("\n")

# Categorical columns to encode
categorical_columns = ['gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'SeniorCitizen',
                       'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
                       'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod']

#  Perform ordinal encoding
dataset, oe = ordinal_encode(dataset, categorical_columns)

I applied one-hot encoding to the categorical features in my dataset. This process involves transforming categorical variables into a format that can be provided to machine learning algorithms to improve predictions. Specifically, I utilized the OrdinalEncoder, which converts categorical values into numerical representations.

The columns I selected for encoding include gender, Partner, Dependents, PhoneService, MultipleLines, SeniorCitizen, InternetService, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies, Contract, PaperlessBilling, and PaymentMethod. Each of these columns contains non-numeric values that need to be transformed into a numerical format. By encoding these variables, I ensure that the machine learning model can interpret them correctly, as many algorithms rely on numerical inputs. Additionally, one-hot encoding helps in avoiding any unintended ordinal relationships that may be interpreted if the variables were simply converted to integers.

In [25]:
# Display mappings for each categorical variable
display_mappings(oe, categorical_columns)
Mapping for gender:
  0.0 = Female
  1.0 = Male


Mapping for Partner:
  0.0 = No
  1.0 = Yes


Mapping for Dependents:
  0.0 = No
  1.0 = Yes


Mapping for PhoneService:
  0.0 = No
  1.0 = Yes


Mapping for MultipleLines:
  0.0 = No
  1.0 = No phone service
  2.0 = Yes


Mapping for SeniorCitizen:
  0.0 = 0
  1.0 = 1


Mapping for InternetService:
  0.0 = DSL
  1.0 = Fiber optic
  2.0 = No


Mapping for OnlineSecurity:
  0.0 = No
  1.0 = No internet service
  2.0 = Yes


Mapping for OnlineBackup:
  0.0 = No
  1.0 = No internet service
  2.0 = Yes


Mapping for DeviceProtection:
  0.0 = No
  1.0 = No internet service
  2.0 = Yes


Mapping for TechSupport:
  0.0 = No
  1.0 = No internet service
  2.0 = Yes


Mapping for StreamingTV:
  0.0 = No
  1.0 = No internet service
  2.0 = Yes


Mapping for StreamingMovies:
  0.0 = No
  1.0 = No internet service
  2.0 = Yes


Mapping for Contract:
  0.0 = Month-to-month
  1.0 = One year
  2.0 = Two year


Mapping for PaperlessBilling:
  0.0 = No
  1.0 = Yes


Mapping for PaymentMethod:
  0.0 = Bank transfer (automatic)
  1.0 = Credit card (automatic)
  2.0 = Electronic check
  3.0 = Mailed check


The output shows the mappings of the categorical variables after using OrdinalEncoder to encode the values.

In [26]:
dataset.head()
Out[26]:
gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 0.0 0.0 1.0 0.0 1 0.0 1.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 1.0 2.0 29.85 29.85 No
1 1.0 0.0 0.0 0.0 34 1.0 0.0 0.0 2.0 0.0 2.0 0.0 0.0 0.0 1.0 0.0 3.0 56.95 1889.50 No
2 1.0 0.0 0.0 0.0 2 1.0 0.0 0.0 2.0 2.0 0.0 0.0 0.0 0.0 0.0 1.0 3.0 53.85 108.15 Yes
3 1.0 0.0 0.0 0.0 45 0.0 1.0 0.0 2.0 0.0 2.0 2.0 0.0 0.0 1.0 0.0 0.0 42.30 1840.75 No
4 0.0 0.0 0.0 0.0 2 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 2.0 70.70 151.65 Yes
In [27]:
# 2. Normalizing/Scaling numeric features
numerical_columns = ['tenure', 'MonthlyCharges', 'TotalCharges']

# Use MinMaxScaler
scaler = MinMaxScaler()
dataset[numerical_columns] = scaler.fit_transform(dataset[numerical_columns])

Here I focused on normalizing the numeric features in my dataset, specifically tenure, MonthlyCharges, and TotalCharges. I used the MinMaxScaler for this purpose, which scales the values of these features to a specific range, typically between 0 and 1.

Normalization is critical when working with algorithms that are sensitive to the scale of the input data. By scaling the numeric features, I ensure that each feature contributes equally to the distance calculations made by the model, preventing features with larger ranges from dominating the learning process. This step enhances the model's performance and convergence speed, leading to more accurate predictions.

In [28]:
dataset.head()
Out[28]:
gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 0.0 0.0 1.0 0.0 0.013889 0.0 1.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 1.0 2.0 0.115423 0.003437 No
1 1.0 0.0 0.0 0.0 0.472222 1.0 0.0 0.0 2.0 0.0 2.0 0.0 0.0 0.0 1.0 0.0 3.0 0.385075 0.217564 No
2 1.0 0.0 0.0 0.0 0.027778 1.0 0.0 0.0 2.0 2.0 0.0 0.0 0.0 0.0 0.0 1.0 3.0 0.354229 0.012453 Yes
3 1.0 0.0 0.0 0.0 0.625000 0.0 1.0 0.0 2.0 0.0 2.0 2.0 0.0 0.0 1.0 0.0 0.0 0.239303 0.211951 No
4 0.0 0.0 0.0 0.0 0.027778 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 2.0 0.521891 0.017462 Yes
In [29]:
# 3. Encoding target variable 'Churn'
dataset['Churn'] = dataset['Churn'].apply(lambda x: 1 if x == 'Yes' else 0)

Finally, I transformed the target variable Churn from categorical to numerical format by applying a lambda function. This function assigns a value of 1 to entries marked as 'Yes' (indicating churn) and a value of 0 to entries marked as 'No' (indicating no churn).

Converting the target variable into a numerical format is essential for classification tasks, where the model needs to predict binary outcomes. By using 0 and 1 to represent the two classes, I ensure that the machine learning algorithm can effectively process the target variable, allowing it to learn patterns associated with customer churn during the training phase.

In [30]:
# Display the transformed dataset
print(dataset.head())
   gender  SeniorCitizen  Partner  Dependents    tenure  PhoneService  \
0     0.0            0.0      1.0         0.0  0.013889           0.0   
1     1.0            0.0      0.0         0.0  0.472222           1.0   
2     1.0            0.0      0.0         0.0  0.027778           1.0   
3     1.0            0.0      0.0         0.0  0.625000           0.0   
4     0.0            0.0      0.0         0.0  0.027778           1.0   

   MultipleLines  InternetService  OnlineSecurity  OnlineBackup  \
0            1.0              0.0             0.0           2.0   
1            0.0              0.0             2.0           0.0   
2            0.0              0.0             2.0           2.0   
3            1.0              0.0             2.0           0.0   
4            0.0              1.0             0.0           0.0   

   DeviceProtection  TechSupport  StreamingTV  StreamingMovies  Contract  \
0               0.0          0.0          0.0              0.0       0.0   
1               2.0          0.0          0.0              0.0       1.0   
2               0.0          0.0          0.0              0.0       0.0   
3               2.0          2.0          0.0              0.0       1.0   
4               0.0          0.0          0.0              0.0       0.0   

   PaperlessBilling  PaymentMethod  MonthlyCharges  TotalCharges  Churn  
0               1.0            2.0        0.115423      0.003437      0  
1               0.0            3.0        0.385075      0.217564      0  
2               1.0            3.0        0.354229      0.012453      1  
3               0.0            0.0        0.239303      0.211951      0  
4               1.0            2.0        0.521891      0.017462      1  

We can see that the dataset have been encoded and scaled/ normalized.

In [31]:
print(f"Shape of our data frame post encoding is", dataset.shape)
Shape of our data frame post encoding is (7043, 20)
In [32]:
# Checking the data types of all the columns
dataset.dtypes
Out[32]:
0
gender float64
SeniorCitizen float64
Partner float64
Dependents float64
tenure float64
PhoneService float64
MultipleLines float64
InternetService float64
OnlineSecurity float64
OnlineBackup float64
DeviceProtection float64
TechSupport float64
StreamingTV float64
StreamingMovies float64
Contract float64
PaperlessBilling float64
PaymentMethod float64
MonthlyCharges float64
TotalCharges float64
Churn int64

We can see that all the features are now numerical.

Handling Skewness¶

In [33]:
# Checking Skewness
dataset.skew()
Out[33]:
0
gender -0.019031
SeniorCitizen 1.833633
Partner 0.067922
Dependents 0.875199
tenure 0.239540
PhoneService -2.727153
MultipleLines 0.118719
InternetService 0.205423
OnlineSecurity 0.416985
OnlineBackup 0.182930
DeviceProtection 0.186847
TechSupport 0.402365
StreamingTV 0.028486
StreamingMovies 0.014657
Contract 0.630959
PaperlessBilling -0.375396
PaymentMethod -0.170129
MonthlyCharges -0.220524
TotalCharges 0.963235
Churn 1.063031

With the skew method we see that there are columns present in our dataset that are above the acceptable range of +/-0.5 value. However most of those are categorical columns and we do not have to worry about outliers or skewness in catagorical data therefore we will ignore it. I have treated the skewness that is present in our continous data columns.

In [34]:
# Identify numeric columns
number_datatype = dataset.select_dtypes(include=['float64', 'int64']).columns

# Create subplots based on the number of numeric columns
n_cols = 3  # Define the number of columns for the subplot layout
n_rows = (len(number_datatype) + n_cols - 1) // n_cols  # Calculate number of rows needed
fig, ax = plt.subplots(ncols=n_cols, nrows=n_rows, figsize=(15, 4 * n_rows))
ax = ax.flatten()  # Flatten the axes array for easy indexing

# Plot the distribution for each numeric column
for index, col in enumerate(number_datatype):
    sns.histplot(dataset[col], ax=ax[index], kde=True, color="y", stat="density", bins=30)
    ax[index].set_title(f'Distribution of {col}')
    ax[index].set_xlabel(col)
    ax[index].set_ylabel('Density')

# Hide any unused subplots
for i in range(len(number_datatype), len(ax)):
    fig.delaxes(ax[i])

plt.tight_layout(pad=0.4, w_pad=0.4, h_pad=1.0)
plt.show()

In the above distribution plot we can see that our continous data columns have some skewness that will need to be treated and reduced to cover up an acceptable range in data values.

In [35]:
# Identify numeric columns
number_datatype = dataset.select_dtypes(include=['float64', 'int64']).columns

# Handle skewness, excluding the 'Churn' and 'PhoneService' columns
for col in number_datatype:
    if col != 'Churn' and col != 'PhoneService':  # Exclude 'Churn'
        # Check for infinite values and replace them with NaN
        dataset[col] = dataset[col].replace([np.inf, -np.inf], np.nan)

        # Optionally cap extreme values
        threshold = 1e10  # Set a reasonable threshold
        dataset[col] = np.where(dataset[col] > threshold, threshold, dataset[col])

        # Check for NaN values after replacement
        if dataset[col].isnull().any():
            print(f"NaN values found in column '{col}' after replacing infinities.")

        # Now check for skewness and apply transformations
        if dataset[col].skew() > 0.55:  # Check for positive skewness
            dataset[col] = np.log1p(dataset[col])  # Apply log transformation
        elif dataset[col].skew() < -0.55:  # Check for negative skewness
            dataset[col] = -np.log1p(-dataset[col])  # Apply inverse log transformation for negative skewness

# Check for skewness again and apply square root transformation if necessary
for col in number_datatype:
    if col != 'Churn' and col != 'PhoneService' and dataset[col].skew() > 0.55:
        dataset[col] = np.sqrt(dataset[col])  # Square root transformation

In the above code I am preprocessing numeric columns in a dataset to handle skewness and ensure data quality, particularly for the columns excluding the target variable, "Churn." The process begins by identifying numeric columns of type float64 and int64. For each numeric column, the code checks for infinite values (both positive and negative) and replaces them with NaN to avoid complications in further analysis. It also caps extreme values at a defined threshold (1e10) to prevent outliers from disproportionately affecting the analysis. After addressing potential infinite values, the code checks for skewness in each column. If the skewness exceeds 0.55, a log transformation is applied to reduce positive skewness, while an inverse log transformation is used for negative skewness. Finally, it re-evaluates the skewness of the columns, applying a square root transformation if any column still exhibits skewness greater than 0.55. This comprehensive approach aims to normalize the distribution of the numeric data, facilitating more accurate modeling and analysis.

In [36]:
# Print skewness after transformation
print("Skewness after transformation:\n", dataset[number_datatype].skew())
Skewness after transformation:
 gender             -0.019031
SeniorCitizen       1.833633
Partner             0.067922
Dependents          0.875199
tenure              0.239540
PhoneService       -2.727153
MultipleLines       0.118719
InternetService     0.205423
OnlineSecurity      0.416985
OnlineBackup        0.182930
DeviceProtection    0.186847
TechSupport         0.402365
StreamingTV         0.028486
StreamingMovies     0.014657
Contract            0.434281
PaperlessBilling   -0.375396
PaymentMethod      -0.170129
MonthlyCharges     -0.220524
TotalCharges        0.131658
Churn               1.063031
dtype: float64
In [37]:
# Identify numeric columns
number_datatype = dataset.select_dtypes(include=['float64', 'int64']).columns

# Create subplots based on the number of numeric columns
n_cols = 3  # Define the number of columns for the subplot layout
n_rows = (len(number_datatype) + n_cols - 1) // n_cols  # Calculate number of rows needed
fig, ax = plt.subplots(ncols=n_cols, nrows=n_rows, figsize=(15, 4 * n_rows))
ax = ax.flatten()  # Flatten the axes array for easy indexing

# Plot the distribution for each numeric column
for index, col in enumerate(number_datatype):
    sns.histplot(dataset[col], ax=ax[index], kde=True, color="y", stat="density", bins=30)
    ax[index].set_title(f'Distribution of {col}')
    ax[index].set_xlabel(col)
    ax[index].set_ylabel('Density')

# Hide any unused subplots
for i in range(len(number_datatype), len(ax)):
    fig.delaxes(ax[i])

plt.tight_layout(pad=0.4, w_pad=0.4, h_pad=1.0)
plt.show()

We can see that the Skewness has been reduced.

In [38]:
dataset.head()
Out[38]:
gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 0.0 0.0 1.0 0.0 0.013889 0.0 1.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.000000 1.0 2.0 0.115423 0.058576 0
1 1.0 0.0 0.0 0.0 0.472222 1.0 0.0 0.0 2.0 0.0 2.0 0.0 0.0 0.0 0.693147 0.0 3.0 0.385075 0.443680 0
2 1.0 0.0 0.0 0.0 0.027778 1.0 0.0 0.0 2.0 2.0 0.0 0.0 0.0 0.0 0.000000 1.0 3.0 0.354229 0.111247 1
3 1.0 0.0 0.0 0.0 0.625000 0.0 1.0 0.0 2.0 0.0 2.0 2.0 0.0 0.0 0.693147 0.0 0.0 0.239303 0.438442 0
4 0.0 0.0 0.0 0.0 0.027778 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 1.0 2.0 0.521891 0.131571 1

Here, we can see the preprossesd dataset.

03. Exploratory Data Analysis (EDA)¶

Descriptive Statistics¶

In [39]:
dataset.describe()
Out[39]:
gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
count 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000
mean 0.504756 0.134996 0.483033 0.249424 0.449599 0.903166 0.940508 0.872923 0.790004 0.906432 0.904444 0.797104 0.985376 0.992475 0.409364 0.592219 1.574329 0.462803 0.404014 0.265370
std 0.500013 0.306889 0.499748 0.381402 0.341104 0.295752 0.948554 0.737796 0.859848 0.880162 0.879949 0.861551 0.885002 0.885091 0.472658 0.491457 1.068104 0.299403 0.224392 0.441561
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000 0.125000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.171642 0.211822 0.000000
50% 1.000000 0.000000 0.000000 0.000000 0.402778 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 0.000000 1.000000 2.000000 0.518408 0.385894 0.000000
75% 1.000000 0.000000 1.000000 0.832555 0.763889 1.000000 2.000000 1.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 0.693147 1.000000 2.000000 0.712438 0.601551 1.000000
max 1.000000 0.832555 1.000000 0.832555 1.000000 1.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 1.098612 1.000000 3.000000 1.000000 0.832555 1.000000

The provided summary statistics present key metrics for various features in a dataset containing 7,043 observations related to customer characteristics and their churn behavior.

The statistics indicate that the mean value for gender (0.5) suggests an approximately even distribution between male and female customers.

Notably, the SeniorCitizen feature shows a mean of about 0.135, indicating that a small proportion of customers are senior citizens.

The Partner and Dependents columns reveal that about 48.3% and 24.9% of customers have partners and dependents, respectively.

The tenure averages around 0.45, likely indicating a short customer retention period.

The Churn column indicates the binary nature of customer retention, with a mean value of about 0.265, reflecting a churn rate of around 26.5%.

Overall, these statistics provide a foundational understanding of the dataset's composition, highlighting areas for further exploration and potential preprocessing.

In [40]:
# Calculate statistics for each column in the dataset
mean_values = dataset.mean()
median_values = dataset.median()
variance_values = dataset.var()
std_dev_values = dataset.std()
skewness_values = dataset.skew()
kurtosis_values = dataset.kurt()

# Combine all statistics into a single DataFrame for better visualization
extended_stats = pd.DataFrame({
    'Mean': mean_values,
    'Median': median_values,
    'Variance': variance_values,
    'Standard Deviation': std_dev_values,
    'Skewness': skewness_values,
    'Kurtosis': kurtosis_values
})

# Display the extended statistics
print("Extended Descriptive Statistics:")
print(extended_stats)
Extended Descriptive Statistics:
                      Mean    Median  Variance  Standard Deviation  Skewness  \
gender            0.504756  1.000000  0.250013            0.500013 -0.019031   
SeniorCitizen     0.134996  0.000000  0.094181            0.306889  1.833633   
Partner           0.483033  0.000000  0.249748            0.499748  0.067922   
Dependents        0.249424  0.000000  0.145467            0.381402  0.875199   
tenure            0.449599  0.402778  0.116352            0.341104  0.239540   
PhoneService      0.903166  1.000000  0.087469            0.295752 -2.727153   
MultipleLines     0.940508  1.000000  0.899755            0.948554  0.118719   
InternetService   0.872923  1.000000  0.544343            0.737796  0.205423   
OnlineSecurity    0.790004  1.000000  0.739338            0.859848  0.416985   
OnlineBackup      0.906432  1.000000  0.774686            0.880162  0.182930   
DeviceProtection  0.904444  1.000000  0.774310            0.879949  0.186847   
TechSupport       0.797104  1.000000  0.742269            0.861551  0.402365   
StreamingTV       0.985376  1.000000  0.783228            0.885002  0.028486   
StreamingMovies   0.992475  1.000000  0.783386            0.885091  0.014657   
Contract          0.409364  0.000000  0.223406            0.472658  0.434281   
PaperlessBilling  0.592219  1.000000  0.241530            0.491457 -0.375396   
PaymentMethod     1.574329  2.000000  1.140846            1.068104 -0.170129   
MonthlyCharges    0.462803  0.518408  0.089642            0.299403 -0.220524   
TotalCharges      0.404014  0.385894  0.050352            0.224392  0.131658   
Churn             0.265370  0.000000  0.194976            0.441561  1.063031   

                  Kurtosis  
gender           -2.000206  
SeniorCitizen     1.362596  
Partner          -1.995953  
Dependents       -1.234378  
tenure           -1.387372  
PhoneService      5.438908  
MultipleLines    -1.878378  
InternetService  -1.145505  
OnlineSecurity   -1.520966  
OnlineBackup     -1.684892  
DeviceProtection -1.683207  
TechSupport      -1.535001  
StreamingTV      -1.722830  
StreamingMovies  -1.723523  
Contract         -1.574872  
PaperlessBilling -1.859606  
PaymentMethod    -1.211766  
MonthlyCharges   -1.257260  
TotalCharges     -1.174449  
Churn            -0.870211  

The extended descriptive statistics provide a comprehensive overview of the dataset's key metrics across various features.

The mean values indicate that a majority of the categorical variables, such as gender (approximately 50.5% male), SeniorCitizen (approximately 13.5% of individuals are senior citizens), and Partner status (approximately 48.3% have a partner), exhibit a balanced distribution. The tenure variable has a mean of approximately 0.45, suggesting a relatively low average tenure among customers.

In terms of variance and standard deviation, most features display moderate levels of dispersion.

The skewness values indicate some degree of asymmetry, particularly for SeniorCitizen (1.83), which suggests a rightward skew, while features like gender and PaperlessBilling are relatively symmetric.

The kurtosis values reveal that most features exhibit platykurtic distributions, characterized by lighter tails than a normal distribution, which may indicate a tendency toward more moderate outliers. These statistics are crucial for understanding the dataset and identifying potential issues or patterns that may inform subsequent analysis and modeling efforts.

Visualization¶

Bar Charts

In [41]:
# Create a figure for the plots
plt.figure(figsize=(16, 12))

# Count plot for churn
plt.subplot(3, 2, 1)  # 3 rows, 2 columns, first subplot
sns.countplot(data=dataset, x='Churn', palette='pastel')
plt.title('Count Plot for Churn', fontsize=16)
total = len(dataset)
for p in plt.gca().patches:
    plt.annotate(f'{p.get_height() / total * 100:.2f}%',
                 (p.get_x() + p.get_width() / 2., p.get_height()),
                 ha='center', va='bottom', fontsize=12)

# Count plot for gender
plt.subplot(3, 2, 2)  # 3 rows, 2 columns, second subplot
sns.countplot(data=dataset, x='gender', palette='pastel')
plt.title('Count Plot for Gender', fontsize=16)
total = len(dataset)
for p in plt.gca().patches:
    plt.annotate(f'{p.get_height() / total * 100:.2f}%',
                 (p.get_x() + p.get_width() / 2., p.get_height()),
                 ha='center', va='bottom', fontsize=12)

# Count plot for Senior Citizen
plt.subplot(3, 2, 3)  # 3 rows, 2 columns, third subplot
sns.countplot(data=dataset, x='SeniorCitizen', palette='pastel')
plt.title('Count Plot for Senior Citizen', fontsize=16)
total = len(dataset)
for p in plt.gca().patches:
    plt.annotate(f'{p.get_height() / total * 100:.2f}%',
                 (p.get_x() + p.get_width() / 2., p.get_height()),
                 ha='center', va='bottom', fontsize=12)

# Count plot for Partner
plt.subplot(3, 2, 4)  # 3 rows, 2 columns, fourth subplot
sns.countplot(data=dataset, x='Partner', palette='pastel')
plt.title('Count Plot for Partner', fontsize=16)
total = len(dataset)
for p in plt.gca().patches:
    plt.annotate(f'{p.get_height() / total * 100:.2f}%',
                 (p.get_x() + p.get_width() / 2., p.get_height()),
                 ha='center', va='bottom', fontsize=12)

# Count plot for Dependents
plt.subplot(3, 2, 5)  # 3 rows, 2 columns, fifth subplot
sns.countplot(data=dataset, x='Dependents', palette='pastel')
plt.title('Count Plot for Dependents', fontsize=16)
total = len(dataset)
for p in plt.gca().patches:
    plt.annotate(f'{p.get_height() / total * 100:.2f}%',
                 (p.get_x() + p.get_width() / 2., p.get_height()),
                 ha='center', va='bottom', fontsize=12)

# Adjust layout and show the plots
plt.tight_layout()
plt.show()
<ipython-input-41-15478a047323>:6: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.countplot(data=dataset, x='Churn', palette='pastel')
<ipython-input-41-15478a047323>:16: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.countplot(data=dataset, x='gender', palette='pastel')
<ipython-input-41-15478a047323>:26: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.countplot(data=dataset, x='SeniorCitizen', palette='pastel')
<ipython-input-41-15478a047323>:36: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.countplot(data=dataset, x='Partner', palette='pastel')
<ipython-input-41-15478a047323>:46: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.countplot(data=dataset, x='Dependents', palette='pastel')

In the above count plot for churn we can see that "No Churn" values are way higher than "Yes Churn" data. Since this is our target label it indicates an imbalanced data which will need to be rectified later on.

In the above count plot for gender we can see that 0 indicates female while 1 indicates male. The total number of male and female gender is almost same still males being at a higher end than female covering the data points in our gender column.

In the above count plot for senior citizen we can see that the senior citizen value shows 0 to be not resent and 1 as being a senior citizen. So as we see less number of 1 present in our column it indicates that the number of non senior citizens is quite high than the senior citizens data.

In the above count plot for partner we can see that the partner details are almost similar indicating that people with no partner are a bit higher than people who have partners.

In the above count plot for dependants we can see that people with dependents are way less that people who do not have anyone dependent on them.

In [42]:
# Set the style for the plots
plt.style.use('seaborn-dark-palette')

# List of columns to plot
columns_to_plot = ['gender', 'SeniorCitizen', 'Partner', 'Dependents']

# Create subplots
plt.figure(figsize=(16, 12))

for i, col in enumerate(columns_to_plot):
    plt.subplot(2, 2, i + 1)  # Create a grid of 2x2 for subplots
    aspect_ratio = 2.0 if col == 'PaymentMethod' else 0.8  # Set aspect ratio if needed
    sns.countplot(x=col, hue='Churn', data=dataset, palette='pastel', edgecolor='black')

    # Add title and labels
    plt.title(f'Count Plot for {col.capitalize()}', fontsize=16)
    plt.xlabel(col.capitalize(), fontsize=14)
    plt.ylabel('Count', fontsize=14)

# Adjust layout and show the plots
plt.tight_layout()
plt.show()
<ipython-input-42-bc3369fc894b>:2: MatplotlibDeprecationWarning: The seaborn styles shipped by Matplotlib are deprecated since 3.6, as they no longer correspond to the styles shipped by seaborn. However, they will remain available as 'seaborn-v0_8-<style>'. Alternatively, directly use the seaborn API instead.
  plt.style.use('seaborn-dark-palette')

In the above plot we can see How the churn is distributed with gender, senior citizen, prtner and dependant.

Pie Chart

In [43]:
# Function to plot pie charts for categorical variables
def plot_pie_chart(column_name, title):
    plt.figure(figsize=(8, 8))
    dataset[column_name].value_counts().plot.pie(autopct='%1.1f%%', startangle=90, cmap='Set2', explode=[0.1] * len(dataset[column_name].value_counts()))
    plt.title(title, fontsize=16)
    plt.ylabel('')  # Remove y-label for clarity
    plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle
    plt.show()

# Pie charts for specified categorical variables
plot_pie_chart('gender', 'Distribution of Gender')
plot_pie_chart('SeniorCitizen', 'Distribution of Senior Citizen')
plot_pie_chart('Partner', 'Distribution of Partner')
plot_pie_chart('Dependents', 'Distribution of Dependents')

We have done the same with pie charts.

Box Plot

In [44]:
plt.figure(figsize=(10, 6))
sns.boxplot(x='Contract', y='MonthlyCharges', data=dataset, palette='coolwarm')
plt.title('Box Plot of Monthly Charges by Contract Type')
plt.show()
<ipython-input-44-395532b2d423>:2: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(x='Contract', y='MonthlyCharges', data=dataset, palette='coolwarm')

The box plot shows the relationship between different contract types and the MonthlyCharges paid by customers. The contract types are mapped as follows:

0.0 = Month-to-month

1.0 = One year

2.0 = Two year

Observations:

Month-to-month (0.0):

The median MonthlyCharges for customers with a month-to-month contract is slightly below 0.6 on a normalized scale (which might correspond to a higher charge range). The distribution is fairly tight, with most charges falling between 0.2 and 0.8. There is a wider spread in the lower charges compared to the higher end, but the overall range of charges is lower than for longer contracts.

One year (1.0):

The median charge is around 0.55, lower than the month-to-month contract. The spread of the charges is larger, with customers falling within a wider range of monthly charges. The distribution shows a somewhat even spread, indicating that customers with this contract may pay a wide variety of monthly fees, ranging from very low to high amounts.

Two year (2.0):

Customers with a two-year contract have a lower median MonthlyCharges, close to the 0.5 mark.

In [45]:
plt.figure(figsize=(10, 6))
sns.boxplot(x='InternetService', y='TotalCharges', data=dataset, palette='coolwarm')
plt.title('Box Plot of Total Charges by Internet Service')
plt.show()
<ipython-input-45-5aa0065f0b89>:2: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(x='InternetService', y='TotalCharges', data=dataset, palette='coolwarm')

The box plot shows the relationship between different Internet service and the Total Charges paid by customers. The contract types are mapped as follows:

0.0 = DSL

1.0 = Fiber Optic

2.0 = No

No internet has the lowest median total charge while the Fiber optic has the highest median total charge of around 0.5. Fiber optic has the wide spread distribution when comparing with the other two.

Scatter Plot

In [46]:
#Relationship Between Tenure and Monthly Charges
plt.style.use('seaborn-bright')

# Relationship Between Tenure and Monthly Charges
plt.figure(figsize=(10, 6))
sns.relplot(data=dataset, x='tenure', y='MonthlyCharges', hue='Churn', height=6)
sns.rugplot(data=dataset, x='tenure', hue='Churn', legend=False)

# Set titles and labels
plt.title('Relationship Between Tenure and Monthly Charges', fontsize=16)
plt.xlabel('Tenure (Months)', fontsize=14)
plt.ylabel('Monthly Charges ($)', fontsize=14)
plt.show()
<ipython-input-46-593546300f72>:2: MatplotlibDeprecationWarning: The seaborn styles shipped by Matplotlib are deprecated since 3.6, as they no longer correspond to the styles shipped by seaborn. However, they will remain available as 'seaborn-v0_8-<style>'. Alternatively, directly use the seaborn API instead.
  plt.style.use('seaborn-bright')
<Figure size 1000x600 with 0 Axes>

The above plot shows us that people in the initial months of service usage are more prone to discontinue if they are unhappy with the service offered to them and therefore strong retension on those phase are required.

In [47]:
#Relationship Between Tenure and Total Charges
plt.style.use('seaborn-bright')

# Relationship Between Tenure and Total Charges
plt.figure(figsize=(10, 6))
sns.relplot(data=dataset, x='tenure', y='TotalCharges', hue='Churn', height=6)
sns.rugplot(data=dataset, x='tenure', hue='Churn', legend=False)

# Set titles and labels
plt.title('Relationship Between Tenure and Total Charges', fontsize=16)
plt.xlabel('Tenure (Months)', fontsize=14)
plt.ylabel('Total Charges ($)', fontsize=14)
plt.show()
<ipython-input-47-c7284bc967d7>:2: MatplotlibDeprecationWarning: The seaborn styles shipped by Matplotlib are deprecated since 3.6, as they no longer correspond to the styles shipped by seaborn. However, they will remain available as 'seaborn-v0_8-<style>'. Alternatively, directly use the seaborn API instead.
  plt.style.use('seaborn-bright')
<Figure size 1000x600 with 0 Axes>

The above plot shows us that as the tenure increases the total charge increase as well and if people have chosen to spend a high tenure using the service then the churn criteria is low or negligible.

In [48]:
# Relationship Between Total Charges and Monthly Charges
plt.style.use('seaborn-bright')

# Relationship Between Total Charges and Monthly Charges
plt.figure(figsize=(10, 6))
sns.relplot(data=dataset, x='TotalCharges', y='MonthlyCharges', hue='Churn', height=6)
sns.rugplot(data=dataset, x='TotalCharges', y='MonthlyCharges', hue='Churn', legend=False)

# Set titles and labels
plt.title('Relationship Between Total Charges and Monthly Charges', fontsize=16)
plt.xlabel('Total Charges ($)', fontsize=14)
plt.ylabel('Monthly Charges ($)', fontsize=14)
plt.show()
<ipython-input-48-06ed07957d72>:2: MatplotlibDeprecationWarning: The seaborn styles shipped by Matplotlib are deprecated since 3.6, as they no longer correspond to the styles shipped by seaborn. However, they will remain available as 'seaborn-v0_8-<style>'. Alternatively, directly use the seaborn API instead.
  plt.style.use('seaborn-bright')
<Figure size 1000x600 with 0 Axes>

The above plot shows us that people with high monthly charges tend to leave the service and are not able to contribute much to the total charges. Similarly if a customer has chosen to stick around with the reasonable monthly charges then there is a steep increase with the total charges as well.

Correlation Matrix

In [49]:
plt.style.use('seaborn-pastel')

upper_triangle = np.triu(dataset.corr())
plt.figure(figsize=(20,10))
sns.heatmap(dataset.corr(), vmin=-1, vmax=1, annot=True, square=True, fmt='0.3f',
            annot_kws={'size':8}, cmap="hsv", mask=upper_triangle)
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)
plt.show()
<ipython-input-49-1c2f59d9824e>:1: MatplotlibDeprecationWarning: The seaborn styles shipped by Matplotlib are deprecated since 3.6, as they no longer correspond to the styles shipped by seaborn. However, they will remain available as 'seaborn-v0_8-<style>'. Alternatively, directly use the seaborn API instead.
  plt.style.use('seaborn-pastel')

Positive correlation - A correlation of +1 indicates a perfect positive correlation, meaning that both variables move in the same direction together.

Negative correlation - A correlation of –1 indicates a perfect negative correlation, meaning that as one variable goes up, the other goes down.

In the above heatmap we can see the correlation details plus we can determine that there is no multi colinearity issue between our columns.

From the heatmap, we can observe several key correlations. For instance, Churn shows a low correlation with MonthlyCharges (0.193) and a negative correlation with TotalCharges (-0.227), suggesting that higher charges are associated with lower rates of customer churn. The variable Dependents has a strong positive correlation with partner (0.453) and contracts (0.244), indicating that customers with dependents are more likely to have partners. The tenure variable exhibits a notable positive correlation with both MonthlyCharges (0.248) and TotalCharges (0.871), indicating that longer-tenured customers tend to pay more in monthly and total charges.

The heatmap provides a quick reference for identifying potential relationships among variables, which can inform further analysis and modeling efforts in understanding customer behavior and churn. ​

In [50]:
#Correlation Bar Plot comparing features with our label
plt.style.use('seaborn-white')

df_corr = dataset.corr()
plt.figure(figsize=(10,5))
df_corr['Churn'].sort_values(ascending=False).drop('Churn').plot.bar()
plt.title("Correlation of Features vs Income Label\n", fontsize=16)
plt.xlabel("\nFeatures List", fontsize=14)
plt.ylabel("Correlation Value", fontsize=12)
plt.show()
<ipython-input-50-f21e060e643b>:2: MatplotlibDeprecationWarning: The seaborn styles shipped by Matplotlib are deprecated since 3.6, as they no longer correspond to the styles shipped by seaborn. However, they will remain available as 'seaborn-v0_8-<style>'. Alternatively, directly use the seaborn API instead.
  plt.style.use('seaborn-white')

Since the heatmap was not able to give us a clearer picture on positive and negative correlation columns we have generated this bar plot and we see that columns monthly charges, paperless billing, senior citizen, payment method, multiple lines and Phone service are positively correlated with our target label churn while all the remaining features are negatively correlated with our label column.

KDE Plots

In [51]:
# Set the style
plt.style.use('seaborn-bright')

# Create a KDE plot for Monthly Charges
plt.figure(figsize=(12, 6))
sns.kdeplot(data=dataset, x='MonthlyCharges', hue='Churn', fill=True, common_norm=False, alpha=0.5)
plt.title('KDE Plot for Monthly Charges by Churn', fontsize=16)
plt.xlabel('Monthly Charges ($)', fontsize=14)
plt.ylabel('Density', fontsize=14)
plt.show()
<ipython-input-51-0c2a681e075f>:2: MatplotlibDeprecationWarning: The seaborn styles shipped by Matplotlib are deprecated since 3.6, as they no longer correspond to the styles shipped by seaborn. However, they will remain available as 'seaborn-v0_8-<style>'. Alternatively, directly use the seaborn API instead.
  plt.style.use('seaborn-bright')

The presented graph is a Kernel Density Estimate (KDE) plot illustrating the distribution of Monthly Charges in relation to customer Churn. The plot features two distinct curves: one for customers who did not churn (indicated by the blue area) and another for those who did churn (shown in green). The density values along the vertical axis represent the likelihood of different monthly charge amounts, with the x-axis showing the range of monthly charges. The overlapping areas highlight how the distribution of charges varies between churned and non-churned customers. Notably, the plot reveals that non-churned customers tend to have a higher density at lower monthly charges, while churned customers display a broader range, suggesting that higher monthly charges may correlate with a greater likelihood of customer churn. This visual representation is valuable for understanding the factors influencing customer retention and can inform strategies for improving customer loyalty.

In [52]:
# Create a KDE plot for Total Charges
plt.figure(figsize=(12, 6))
sns.kdeplot(data=dataset, x='TotalCharges', hue='Churn', fill=True, common_norm=False, alpha=0.5)
plt.title('KDE Plot for Total Charges by Churn', fontsize=16)
plt.xlabel('Total Charges ($)', fontsize=14)
plt.ylabel('Density', fontsize=14)
plt.show()

The displayed graph is a Kernel Density Estimate (KDE) plot showcasing the distribution of Total Charges categorized by customer Churn. The plot features two overlapping areas: one representing customers who did not churn (shown in blue) and the other representing those who did churn (in green). The y-axis indicates the density, reflecting the likelihood of different total charge amounts, while the x-axis outlines the range of total charges. This visualization reveals that non-churned customers tend to cluster around lower total charges, suggesting they may have incurred fewer costs over time. In contrast, churned customers display a broader distribution across the total charge range, indicating variability in their spending. The KDE plot provides valuable insights into customer behavior and can inform retention strategies by highlighting potential financial thresholds that may impact churn rates.

Violin Plot

In [53]:
# List of numerical columns
numerical_columns = [
    'gender', 'SeniorCitizen', 'Partner', 'Dependents',
    'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
    'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
    'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract',
    'PaperlessBilling', 'PaymentMethod', 'MonthlyCharges',
    'TotalCharges', 'Churn'
]

# Create violin plots for each feature
n_cols = 3
n_rows = (len(numerical_columns) + n_cols - 1) // n_cols
fig, ax = plt.subplots(ncols=n_cols, nrows=n_rows, figsize=(15, 4 * n_rows))
ax = ax.flatten()

for index, col in enumerate(numerical_columns):
    sns.violinplot(x=dataset[col], ax=ax[index], color='lightgreen')
    ax[index].set_title(f'Violin Plot of {col}')
    ax[index].set_xlabel(col)

# Hide any unused subplots
for i in range(len(numerical_columns), len(ax)):
    fig.delaxes(ax[i])

plt.tight_layout(pad=0.4, w_pad=0.4, h_pad=1.0)
plt.show()

The violin plots in the image show the distribution of values for each feature in the dataset. A violin plot combines aspects of a box plot and a kernel density plot. The wider areas represent where more data points are concentrated, while narrower sections indicate fewer data points.

Here’s a breakdown of some of the plots:

Gender, SeniorCitizen, Partner, PhoneService, and Dependents: These features appear to be binary (0 or 1), and the violin plots reflect this, showing two distinct regions at 0 and 1. The symmetry in these plots shows the distribution is fairly even between the two categories, though some, like "SeniorCitizen" and "Dependents," are more heavily skewed toward 0.

Tenure: The tenure plot shows a continuous distribution, with more values concentrated toward lower tenure (near 0) and gradually tapering off toward higher tenure values. This suggests that most customers in this dataset have been with the service provider for shorter periods.

Handling Class Imbalance Issue in our Label Column¶

In [54]:
X = dataset.drop(columns=['Churn'])
Y = dataset['Churn']

I have bifurcated the dataset into features and labels where X represents all the feature columns and Y represents the target label column.

In [55]:
Y.value_counts()
Out[55]:
count
Churn
0 5174
1 1869

I have listed the values of my label column to count the number of rows occupied by each category. This indicates a huge class imbalance that needs to be fixed by using the oversampling method.

In [56]:
# Check for infinite values and print their indices
infinite_indices = np.where(np.isinf(X))
if infinite_indices[0].size > 0:
    print(f"Infinite values found in X at indices: {infinite_indices}")

# Check for NaN values
if np.any(np.isnan(X)):
    print("X contains NaN values.")
In [57]:
# Replace infinite values again with NaN if any exist
X = X.replace([np.inf, -np.inf], np.nan)

# Optionally, handle NaN values (impute or drop)
X.fillna(X.mean(), inplace=True)
In [58]:
# Final check before applying SMOTE
assert not np.any(np.isinf(X)), "X still contains infinite values."
assert not np.any(np.isnan(X)), "X still contains NaN values."

I checked for infinite values in the feature set X by using np.where(np.isinf(X)) to identify their indices and print a message if any were found. I also assessed whether X contained any NaN values and printed a notification if it did. To ensure the integrity of the dataset, I replaced any infinite values in with NaN.

Subsequently, I handled the NaN values by filling them with the mean of their respective columns, which is a common imputation technique to maintain the dataset's structure. Finally, I conducted assertions to confirm that no infinite or NaN values remained in X, ensuring that the dataset was clean and ready for further processing.

In [59]:
# adding samples to make all the categorical label values same


oversample = SMOTE()
X, Y = oversample.fit_resample(X, Y)  # Directly overwrite X and Y

SMOTE is the over sampling mechanism that I am using to ensure that all the categories present in our target label have the same value.

In [60]:
print("Oversampling complete. Shape of X:", X.shape)
print("New distribution of Y after SMOTE:\n", pd.Series(Y).value_counts())
Oversampling complete. Shape of X: (10348, 19)
New distribution of Y after SMOTE:
 Churn
0    5174
1    5174
Name: count, dtype: int64

After applying over sampling I have once again listing the values of our label column to cross verify the updated information.

Here we see that we have successfully resolved the class imbalance problem and now all the categories have same data ensuring that the machine learning model does not get biased towards one category.

In [74]:
dataset.head()
Out[74]:
gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 0.0 0.0 1.0 0.0 0.013889 0.0 1.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.000000 1.0 2.0 0.115423 0.058576 0
1 1.0 0.0 0.0 0.0 0.472222 1.0 0.0 0.0 2.0 0.0 2.0 0.0 0.0 0.0 0.693147 0.0 3.0 0.385075 0.443680 0
2 1.0 0.0 0.0 0.0 0.027778 1.0 0.0 0.0 2.0 2.0 0.0 0.0 0.0 0.0 0.000000 1.0 3.0 0.354229 0.111247 1
3 1.0 0.0 0.0 0.0 0.625000 0.0 1.0 0.0 2.0 0.0 2.0 2.0 0.0 0.0 0.693147 0.0 0.0 0.239303 0.438442 0
4 0.0 0.0 0.0 0.0 0.027778 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 1.0 2.0 0.521891 0.131571 1

Data Splitting¶

In [61]:
def classify(model, X, Y):
    # Split the data into training (70%) and testing (30%) sets
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state=10)

Data Splitting: I am splitting my dataset into training(70%) and testing (30%) sets, which is a crucial step in preparing data for machine learning.

Training Set: This portion of the data is used to train your machine learning model. It allows the model to learn the underlying patterns and relationships in the data.

Testing Set: This portion is kept separate and is used to evaluate the performance of your model after training. It helps assess how well the model generalizes to unseen data.

04. Model Building for Classification with Evaluation Metrics¶

In [62]:
def classify(model, X, Y):
    # Split the data into training (70%) and testing (30%) sets
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state=10)

    # Training the model
    model.fit(X_train, Y_train)

    # Predicting Y_test
    pred = model.predict(X_test)

    # Classification Report
    class_report = classification_report(Y_test, pred)
    print("\nClassification Report:\n", class_report)

    # Accuracy Score
    acc_score = accuracy_score(Y_test, pred) * 100
    print("Accuracy Score:", acc_score)

    # Precision, Recall, F1 Score
    precision = precision_score(Y_test, pred, average='weighted') * 100
    recall = recall_score(Y_test, pred, average='weighted') * 100
    f1 = f1_score(Y_test, pred, average='weighted') * 100

    print("Precision Score:", precision)
    print("Recall Score:", recall)
    print("F1 Score:", f1)

    # Cross Validation Score
    cv_score = cross_val_score(model, X, Y, cv=5).mean() * 100
    print("Cross Validation Score:", cv_score)

    # Result of accuracy minus cv scores
    result = acc_score - cv_score
    print("\nAccuracy Score - Cross Validation Score is", result)

I have defined a function classify that implements a classification model training and evaluation process.

  1. Data Splitting: I am splitting my dataset into training(70%) and testing (30%) sets, which is a crucial step in preparing data for machine learning.

Training Set: This portion of the data is used to train your machine learning model. It allows the model to learn the underlying patterns and relationships in the data.

Testing Set: This portion is kept separate and is used to evaluate the performance of your model after training. It helps assess how well the model generalizes to unseen data.

Model Training: Next, the model is trained using the training data (X_train and Y_train) by calling the fit method. This step adjusts the model parameters based on the input features and corresponding labels.

  1. Prediction: After training, the model makes predictions on the test set (X_test) by calling the predict method. The predicted labels are stored in the variable pred.

  2. Classification Report: The function generates a detailed classification report using classification_report, which includes metrics such as precision, recall, and F1-score. This report provides insights into the model's performance for each class, as well as overall metrics.

  3. Accuracy Score: The accuracy of the model is calculated using accuracy_score, which gives the proportion of correctly predicted instances out of the total instances in the test set. This value is multiplied by 100 to express it as a percentage.

  4. Precision, Recall, and F1 Score: The function computes precision, recall, and F1 scores using precision_score, recall_score, and f1_score, respectively. These metrics provide a more nuanced understanding of model performance, especially in imbalanced datasets. The results are also presented as percentages.

  5. Cross-Validation Score: To assess the model's stability and generalization capability, the function uses cross_val_score with 5-fold cross-validation. This step evaluates the model's performance on multiple subsets of the data, and the mean score is calculated and presented.

  6. Performance Comparison: Finally, the function calculates the difference between the accuracy score and the cross-validation score. This comparison helps to identify if the model is overfitting or if there are significant differences in performance between the training and validation phases.

Below I have used different data mining algorithms to model the dataset. Then I have trained the model, prdicted Y and evaluated the model using metrics such as Accuracy, Precision, Recall, F1 Score, Cross-Validation Score and Performance Comparison.

In [ ]:
# Logistic Regression

model=LogisticRegression()
classify(model, X, Y)
Classification Report:
               precision    recall  f1-score   support

           0       0.80      0.75      0.77      1581
           1       0.75      0.80      0.78      1524

    accuracy                           0.77      3105
   macro avg       0.77      0.77      0.77      3105
weighted avg       0.78      0.77      0.77      3105

Accuracy Score: 77.39130434782608
Precision Score: 77.53198109154162
Recall Score: 77.39130434782608
F1 Score: 77.3812186941038
Cross Validation Score: 76.93299990893871

Accuracy Score - Cross Validation Score is 0.4583044388873674

Logistic Regression is a statistical method used for binary classification problems, where the goal is to predict the probability of a binary outcome based on one or more predictor variables. Unlike linear regression, which outputs continuous values, logistic regression applies the logistic function to convert the predicted values into probabilities ranging from 0 to 1. This is particularly useful for classification tasks, such as predicting customer churn, where outcomes can be represented as "churned" or "not churned."

In assessing the performance of the Logistic Regression model, various evaluation metrics are employed, including precision, recall, F1-score, and accuracy, as outlined in the provided classification report:

Precision measures the proportion of true positive predictions among all positive predictions. In this case, the precision for the "not churned" class (0) is 0.80, indicating that 80% of the customers predicted not to churn actually did not churn. The precision for the "churned" class (1) is 0.75, meaning that 75% of those predicted to churn actually churned. High precision is crucial in scenarios where false positives are costly.

Recall evaluates the model's ability to identify all relevant instances within the positive class. The recall for the non-churned class is 0.75, while for the churned class, it is 0.80. This indicates that the model effectively identifies 80% of actual churners, highlighting its strength in minimizing false negatives, which is particularly important for proactive retention strategies.

F1 Score combines precision and recall into a single metric by calculating their harmonic mean. The F1 scores for both classes are relatively balanced around 0.77, indicating a good trade-off between precision and recall. This score is especially useful when dealing with imbalanced datasets, as it provides a more comprehensive measure of model performance than accuracy alone.

Accuracy reflects the overall proportion of correct predictions made by the model. With an accuracy score of 77.39%, the model demonstrates solid performance. However, it’s important to contextualize accuracy, particularly in imbalanced datasets, where a high accuracy may not indicate effective classification.

Cross Validation Score (76.93%) further confirms the model's reliability by evaluating its performance across multiple subsets of the data. The small difference between the accuracy score and the cross-validation score (approximately 0.458%) suggests that the model's performance is stable and not overly sensitive to the specific training data used.

In [ ]:
# Support Vector Classifier

model=SVC(C=1.0, kernel='rbf', gamma='auto', random_state=42)
classify(model, X, Y)
Classification Report:
               precision    recall  f1-score   support

           0       0.81      0.74      0.78      1581
           1       0.75      0.82      0.79      1524

    accuracy                           0.78      3105
   macro avg       0.78      0.78      0.78      3105
weighted avg       0.78      0.78      0.78      3105

Accuracy Score: 78.09983896940419
Precision Score: 78.33821759467632
Recall Score: 78.09983896940419
F1 Score: 78.07733740286798
Cross Validation Score: 77.80277993756467

Accuracy Score - Cross Validation Score is 0.29705903183952387

Support Vector Classifier (SVC) is a type of supervised machine learning algorithm that is particularly effective for classification tasks. It works by finding the optimal hyperplane that best separates the data points of different classes in a high-dimensional space. The main idea is to maximize the margin between the closest data points of each class (support vectors), thus ensuring that the classifier generalizes well to unseen data.

The performance of the Support Vector Classifier can be evaluated using several key metrics, as provided in the classification report:

Precision measures the accuracy of the positive predictions made by the classifier. For the "not churned" class (0), the precision is 0.81, indicating that 81% of the customers predicted not to churn actually did not churn. The precision for the "churned" class (1) is 0.75, meaning that 75% of the predicted churners were indeed churners. A high precision is desirable, especially in contexts where false positives have significant repercussions.

Recall assesses the classifier's ability to identify all relevant instances within the positive class. The recall for the non-churned class is 0.74, while for the churned class, it is 0.82. This indicates that the classifier successfully identifies 82% of actual churners, showcasing its effectiveness in minimizing false negatives, which is crucial for strategies aimed at customer retention.

F1 Score provides a balanced measure by combining precision and recall into a single score, calculated as their harmonic mean. The F1 scores for both classes hover around 0.78 and 0.79, indicating a good balance between precision and recall. This is especially useful in imbalanced datasets where it’s vital to consider both false positives and false negatives.

Accuracy reflects the overall proportion of correct predictions made by the model. With an accuracy score of 78.1%, the SVC model shows solid performance, correctly predicting a significant portion of the instances. Like with logistic regression, it's essential to interpret accuracy carefully, particularly in cases of class imbalance.

Cross Validation Score (77.80%) enhances the reliability of the model's performance assessment by evaluating it across multiple training/test splits. The difference between the accuracy score and cross-validation score (approximately 0.29%) suggests that the model's performance is consistent and not overly dependent on the specific training data used.

In [ ]:
# Decision Tree Classifier

model=DecisionTreeClassifier(random_state=21, max_depth=15)
classify(model, X, Y)
Classification Report:
               precision    recall  f1-score   support

           0       0.81      0.76      0.78      1581
           1       0.77      0.81      0.79      1524

    accuracy                           0.79      3105
   macro avg       0.79      0.79      0.79      3105
weighted avg       0.79      0.79      0.79      3105

Accuracy Score: 78.55072463768116
Precision Score: 78.67946162382069
Recall Score: 78.55072463768116
F1 Score: 78.5429857121728
Cross Validation Score: 79.28192807092506

Accuracy Score - Cross Validation Score is -0.7312034332439055

A Decision Tree Classifier is a powerful, non-parametric machine learning algorithm that works by recursively splitting the dataset into subsets based on the value of input features. Each node in the tree represents a feature, and the branches represent possible outcomes, leading to decision rules that ultimately predict a class label at the leaf nodes.They are also capable of handling both numerical and categorical data, making them versatile for many applications.

The performance of the Decision Tree Classifier can be assessed using the following metrics from the classification report:

Precision: Precision quantifies how many of the predicted positive cases were actually positive. In this case, the precision for the non-churned class (0) is 0.81, and for the churned class (1), it is 0.77. This suggests that 81% of the customers predicted not to churn did indeed not churn, while 77% of those predicted to churn actually churned. The precision values are reasonably high, meaning the model is making relatively few false positive errors.

Recall: Recall, or sensitivity, measures how well the classifier identifies all the relevant cases in the dataset. For the non-churned class (0), the recall is 0.76, and for the churned class (1), it is 0.81. This indicates that the model successfully captures 81% of all actual churners. High recall is crucial when it’s important to minimize false negatives, such as in customer churn predictions, where failing to identify potential churners can be costly.

F1 Score: The F1 score is the harmonic mean of precision and recall. It provides a more balanced view of performance, particularly when there’s an uneven class distribution. Both classes have similar F1 scores (0.79), indicating a balanced performance with a good trade-off between precision and recall.

Accuracy: The accuracy of the decision tree model is 78.55%, meaning that the model correctly predicts the churn status of about 79% of all customers. While accuracy is useful, it doesn’t tell the whole story, especially if there is an imbalance in the dataset. It is important to consider precision and recall along with accuracy for a holistic view of the model's performance.

Cross-Validation Score: The cross-validation score is 79.28%, which suggests that the model's performance is stable across different data splits. The The difference between the accuracy score and cross-validation score is minimal (-0.73%), meaning that the decision tree model generalizes well to new data and does not overfit the training set.

In [ ]:
# Random Forest Classifier

model=RandomForestClassifier(max_depth=15, random_state=111)
classify(model, X, Y)
Classification Report:
               precision    recall  f1-score   support

           0       0.87      0.83      0.85      1581
           1       0.83      0.88      0.85      1524

    accuracy                           0.85      3105
   macro avg       0.85      0.85      0.85      3105
weighted avg       0.85      0.85      0.85      3105

Accuracy Score: 85.18518518518519
Precision Score: 85.29784119720789
Recall Score: 85.18518518518519
F1 Score: 85.18203486254605
Cross Validation Score: 84.89661742352604

Accuracy Score - Cross Validation Score is 0.2885677616591522

The Random Forest Classifier is an ensemble learning method that operates by constructing multiple decision trees during training and outputting the class that is the mode (most frequent) of the classes predicted by individual trees. It improves upon decision trees by reducing overfitting and improving generalization through a combination of several weak learners (decision trees) that each contribute to the final decision. Each tree in the random forest is trained on a different random subset of the data, and only a random subset of features is considered for splitting at each node, adding diversity to the model. This ensemble technique increases robustness and predictive power.

From the classification report and performance scores, the Random Forest Classifier's results are evaluated as follows:

Precision: Precision is 0.87 for the non-churned class (0) and 0.83 for the churned class (1), indicating that the model makes few false positives. In this case, 87% of the customers predicted to not churn indeed did not, while 83% of those predicted to churn were actual churners. High precision is valuable when false positives are costly to the business.

Recall: Recall is slightly higher for the churned class (1) at 0.88 compared to the non-churned class (0) at 0.83. This means the model is effectively identifying customers likely to churn, capturing 88% of all actual churners. High recall is essential when the objective is to capture as many potential churners as possible.

F1 Score: The F1 score is balanced between the classes, with both having a value of 0.85. The F1 score balances precision and recall, making it a good measure when class distribution is reasonably balanced, as it shows a strong performance in both minimizing false positives and capturing most actual churners.

Accuracy: The overall accuracy is 85.18%, meaning the model correctly predicts the churn status for nearly 85% of all customers. This is a significant improvement over the simpler models (e.g., decision trees), indicating the Random Forest model's ability to generalize well on unseen data.

Cross-Validation Score: The cross-validation score of 84.89% suggests that the model performs consistently across different data splits, with a very small difference from the accuracy score (0.29%). This consistency indicates that the Random Forest model is stable and not prone to overfitting.

In [ ]:
# K Neighbors Classifier

model=KNeighborsClassifier(n_neighbors=15)
classify(model, X, Y)
Classification Report:
               precision    recall  f1-score   support

           0       0.83      0.63      0.71      1581
           1       0.69      0.87      0.77      1524

    accuracy                           0.75      3105
   macro avg       0.76      0.75      0.74      3105
weighted avg       0.76      0.75      0.74      3105

Accuracy Score: 74.58937198067633
Precision Score: 76.40076761350775
Recall Score: 74.58937198067633
F1 Score: 74.2332629290607
Cross Validation Score: 76.35310764144269

Accuracy Score - Cross Validation Score is -1.7637356607663577

The K-Nearest Neighbors (KNN) classifier is a simple, instance-based learning algorithm where classification is determined based on the majority class among the K-nearest data points in the feature space. The "distance" to neighbors is typically measured using Euclidean distance, and the algorithm is non-parametric, meaning it makes no assumptions about the underlying data distribution. KNN can be effective when data is well-distributed, but its performance can degrade with imbalanced or high-dimensional data, and it can be computationally expensive with large datasets.

Based on the classification report and performance scores, the KNN classifier's performance is as follows:

Precision: The precision for the non-churned class (0) is 0.83, while it is lower for the churned class (1) at 0.69. This indicates that the model makes more false positives when predicting customers will churn. While 83% of the non-churned predictions are correct, only 69% of churn predictions are accurate. A lower precision for the churned class could lead to unnecessary retention efforts for non-churners.

Recall: The recall for the churned class (1) is higher at 0.87 compared to 0.63 for the non-churned class (0). This means the model is good at identifying actual churners (capturing 87% of churners), but it struggles with identifying non-churners, missing 37% of them. This is concerning in situations where the model needs to identify non-churners accurately.

F1 Score: The F1 score, which balances precision and recall, is 0.71 for the non-churned class and 0.77 for the churned class. This shows a somewhat imbalanced performance between classes, with better handling of churners but weaker performance for non-churners.

Accuracy: The overall accuracy is 74.58%, meaning that the model correctly classifies about 75% of customers. While acceptable, this is lower compared to other models like Random Forest or Decision Tree. It indicates that the KNN classifier may not be the best choice for this dataset.

Cross-Validation Score: The cross-validation score of 76.35% is slightly higher than the accuracy score, indicating that the model's performance is fairly consistent across different subsets of the data. However, the difference of -1.76 suggests the model might be slightly underfitting on the training data.

In [63]:
# Extra Trees Classifier

model=ExtraTreesClassifier()
classify(model, X, Y)
Classification Report:
               precision    recall  f1-score   support

           0       0.86      0.84      0.85      1581
           1       0.84      0.86      0.85      1524

    accuracy                           0.85      3105
   macro avg       0.85      0.85      0.85      3105
weighted avg       0.85      0.85      0.85      3105

Accuracy Score: 85.15297906602255
Precision Score: 85.1794420298113
Recall Score: 85.15297906602255
F1 Score: 85.15411870121126
Cross Validation Score: 86.39438408715733

Accuracy Score - Cross Validation Score is -1.2414050211347814

The Extra Trees Classifier (Extremely Randomized Trees) is an ensemble learning method that builds multiple decision trees and aggregates their results for improved predictive performance. Unlike Random Forest, where a random subset of features is used to split at each node, Extra Trees introduces additional randomness by selecting cut-points for splits randomly, which helps to reduce variance and overfitting. This makes it faster and sometimes more accurate, depending on the dataset. Extra Trees also performs well on large datasets with high-dimensional features and can manage imbalanced datasets effectively.

Based on the classification report and performance scores, the Extra Trees Classifier provides the following results:

Precision: The precision is high for both the non-churned class (0) and the churned class (1) at around 0.86 and 0.84, respectively. This indicates that the model is good at predicting both classes with relatively few false positives. It shows balanced precision, making it a robust classifier for both classes.

Recall: The recall is also balanced at 0.84 for the non-churned class and 0.86 for the churned class, showing that the model can correctly identify most of the actual churners and non-churners. The balanced recall makes it a well-rounded model that performs well across both classes without much bias.

F1 Score: The F1 score, which combines both precision and recall, is around 0.85 for both classes. This balance between precision and recall indicates that the model is well-suited for classifying both churners and non-churners. The F1 score reflects that the model has consistent predictive performance.

Accuracy: The overall accuracy of the model is 85.15%, meaning that about 85% of the customers are classified correctly. This is a solid performance and close to the best-performing models like Random Forest. It shows that the Extra Trees Classifier is effective for this dataset.

Cross-Validation Score: The cross-validation score of 86.39% is slightly higher than the model's accuracy score, which indicates a well-generalized model. However, the difference of -1.24% (accuracy score minus cross-validation score) suggests the model slightly underperforms on unseen data compared to the validation sets, but the difference is relatively small.

In [ ]:
# XGB Classifier

model=xgb.XGBClassifier(verbosity=0)
classify(model, X, Y)
Classification Report:
               precision    recall  f1-score   support

           0       0.84      0.85      0.85      1581
           1       0.84      0.83      0.84      1524

    accuracy                           0.84      3105
   macro avg       0.84      0.84      0.84      3105
weighted avg       0.84      0.84      0.84      3105

Accuracy Score: 84.2512077294686
Precision Score: 84.25214964557422
Recall Score: 84.2512077294686
F1 Score: 84.24915159538881
Cross Validation Score: 83.59233964458079

Accuracy Score - Cross Validation Score is 0.6588680848878141

The XGBoost Classifier (Extreme Gradient Boosting) is a powerful, efficient implementation of gradient boosting, widely used in predictive modeling tasks due to its performance, speed, and handling of overfitting. It builds an ensemble of weak learners (typically decision trees), focusing on correcting errors from previous iterations by assigning higher weights to misclassified instances. XGBoost supports regularization (L1 and L2), which helps prevent overfitting, and is highly effective in both classification and regression problems, particularly with imbalanced or large datasets.

Based on the classification report and performance scores, the XGBoost Classifier yields the following results:

Precision: Precision is approximately 0.84 for the non-churned class (0) and 0.84 for the churned class (1). This indicates that the model predicts both classes well, with relatively few false positives. It slightly favors the non-churn class but performs consistently across both.

Recall: The recall is balanced, with classes having a recall around and 0.85 and 0.83. This shows that the model correctly identifies most of the true churners and non-churners, minimizing false negatives and making it reliable for capturing both customer churn and retention.

F1 Score: The F1 score, which balances precision and recall, is 0.85 for (0) and 0.84 for (1) for the classes. This indicates that the model has a consistent predictive ability without bias towards any class. The balance between precision and recall confirms that the XGBoost classifier handles both false positives and false negatives effectively.

Accuracy: The overall accuracy of the model is 84.25%, indicating that the model correctly classifies about 84% of the customers. This is a strong performance and comparable to other top models like Random Forest and Extra Trees.

Cross-Validation Score: The cross-validation score is 83.59%, very close to the model's accuracy score. The difference of only 0.65% (accuracy score minus cross-validation score) suggests the model generalizes well on unseen data. This tiny gap between the two scores indicates the model is not overfitting and is robust when applied to different data subsets.

Evalution of the Models

After evaluating various classification models on the customer churn dataset, the following summarizes the performance of each algorithm:

Logistic Regression: The Logistic Regression model demonstrates balanced performance, with an accuracy score of 77.39% and a cross-validation (CV) score of 76.93%. The difference between these scores is minimal (0.46%), suggesting that the model generalizes well and avoids overfitting. However, the performance metrics (Precision, Recall, and F1 Score) are all close to 77%, indicating that while it provides a reasonable balance between precision and recall, there is room for improvement in model complexity to achieve better accuracy without sacrificing generalizability.

Support Vector Classifier (SVC): The SVC model improves slightly on the Logistic Regression with an accuracy of 78.10% and a cross-validation score of 77.80%, resulting in a small difference of 0.30%. This indicates that the SVC also generalizes well across different data samples, with marginally better performance than Logistic Regression. The Precision, Recall, and F1 Score being around 78% further show that the SVC is a consistent performer, but like Logistic Regression, it does not overfit.

Decision Tree Classifier: The Decision Tree Classifier achieves an accuracy of 78.55%, slightly higher than both Logistic Regression and SVC, but its cross-validation score of 79.28% is slightly higher than its accuracy score, with a negative difference of -0.73%. This slight increase in the CV score suggests that the model might actually underfit to the training data in some cases, as the generalization performance appears better than the training performance.

Random Forest Classifier The Random Forest model is one of the top performers with an accuracy of 85.19% and a cross-validation score of 84.90%, with only a slight difference of 0.29%. This indicates a robust and well-generalized model with minimal overfitting. Its high F1 Score (85.18%) suggests that it handles both precision and recall effectively, making it a strong choice for predictive tasks.

K-Neighbors Classifier (KNN): The KNN model has a lower accuracy score of 74.59%, and its cross-validation score is actually higher at 76.35%, resulting in a difference of -1.76%. This negative difference indicates underfitting, meaning the model is too simplistic and struggles to capture patterns in the data. It performs poorly compared to other models in terms of both accuracy and generalizability.

Extra Trees Classifier: The Extra Trees model is another top performer, with an accuracy score of 85.12% and a cross-validation score of 86.23%, with a negative difference of -1.11%. This suggests the model may underfit slightly, but still performs well on unseen data. The consistent high F1 Score (85.12%) confirms that the model effectively balances precision and recall, making it a reliable option for prediction tasks.

XGBoost Classifier: XGBoost has an accuracy score of 84.25% and a cross-validation score of 83.59%, with a moderate difference of 0.66%. This indicates that the model generalizes well, but slightly overfits to the training data compared to models like Random Forest. However, its high precision, recall, and F1 score around 84% shows that XGBoost is a solid model, though not the top performer among the classifiers tested.

Overall Evaluation

From these results, the Random Forest Classifier and Extra Trees Classifier stand out as top performers due to their high accuracy, precision, and recall, while also maintaining minimal overfitting or underfitting. XGBoost performs well but may exhibit slight overfitting. Logistic Regression and SVC provide reasonable, balanced results and are good models to start with due to their simplicity. Decision Tree shows potential but may benefit from hyperparameter tuning, while KNN clearly underperforms, likely due to its sensitivity to noisy data.

05. Model Optimization¶

Hyperparameter Tuning and Cross-Validation¶

I have Optimized the Random Forest Classifier and the Extra Tree Classifier Using Hyperparameter Tuning and Cross-Validation.

Hyperparameter Tuning

Hyperparameter tuning refers to the process of optimizing the parameters that govern the behavior of a machine learning model. Unlike model parameters, which are learned during training (e.g., the weights in a neural network), hyperparameters are set before training begins and control aspects like model complexity, learning rate, or the number of decision trees in a random forest. Tuning these values can significantly affect a model's performance. Common techniques include Grid Search and Random Search, where different combinations of hyperparameters are tested to find the best-performing configuration.

Cross-Validation

Cross-validation is a technique used to assess the generalizability of a machine learning model by splitting the dataset into multiple subsets. One of the most commonly used methods is k-fold cross-validation, where the data is divided into k equally sized "folds." The model is trained on k-1 folds and tested on the remaining fold. This process is repeated k times, with each fold serving as the test set once. The results are averaged to provide a more robust estimate of model performance, helping to reduce overfitting or underfitting.

Together, hyperparameter tuning and cross-validation allow for the development of models that are both well-optimized and capable of generalizing effectively to new data.

Hyperparameter Tuning and Cross-Validation Random Forest Model

In [ ]:
# Import necessary libraries
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split

# Define a function to classify and evaluate the model
def classify_with_tuning(model, X, Y):
    # Split the data into training (70%) and testing (30%) sets
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state=10)

    # Hyperparameter tuning using Grid Search
    param_grid = {
        'n_estimators': [100, 200, 300],
        'max_depth': [10, 15, 20],
        'min_samples_split': [2, 5, 10],
        'min_samples_leaf': [1, 2, 4],
        'bootstrap': [True, False]
    }

    grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2)

    # Fit the grid search model
    grid_search.fit(X_train, Y_train)

    # Best parameters after tuning
    print("Best Hyperparameters after tuning:\n", grid_search.best_params_)

    # Train the model with the best parameters
    best_model = grid_search.best_estimator_

    # Predict Y_test
    pred = best_model.predict(X_test)

    # Classification Report
    class_report = classification_report(Y_test, pred)
    print("\nClassification Report:\n", class_report)

    # Accuracy Score
    acc_score = accuracy_score(Y_test, pred) * 100
    print("Accuracy Score:", acc_score)

    # Precision, Recall, F1 Score
    precision = precision_score(Y_test, pred, average='weighted') * 100
    recall = recall_score(Y_test, pred, average='weighted') * 100
    f1 = f1_score(Y_test, pred, average='weighted') * 100

    print("Precision Score:", precision)
    print("Recall Score:", recall)
    print("F1 Score:", f1)

    # Cross Validation Score
    cv_score = cross_val_score(best_model, X, Y, cv=5).mean() * 100
    print("Cross Validation Score:", cv_score)

    # Result of accuracy minus cv scores
    result = acc_score - cv_score
    print("\nAccuracy Score - Cross Validation Score is", result)

# Initialize Random Forest Classifier
model = RandomForestClassifier(random_state=111)

# Call the function to classify and tune
classify_with_tuning(model, X, Y)
Fitting 5 folds for each of 162 candidates, totalling 810 fits
Best Hyperparameters after tuning:
 {'bootstrap': False, 'max_depth': 15, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 300}

Classification Report:
               precision    recall  f1-score   support

           0       0.87      0.83      0.85      1581
           1       0.83      0.87      0.85      1524

    accuracy                           0.85      3105
   macro avg       0.85      0.85      0.85      3105
weighted avg       0.85      0.85      0.85      3105

Accuracy Score: 84.70209339774557
Precision Score: 84.7831773065131
Recall Score: 84.70209339774557
F1 Score: 84.7007795482471
Cross Validation Score: 85.1672935885851

Accuracy Score - Cross Validation Score is -0.4652001908395249

In the above code, I implemented a classification model using the Random Forest algorithm and performed hyperparameter tuning to enhance its performance. Hyperparameter tuning was executed through Grid Search, where I defined a parameter grid that included various options for the number of estimators, maximum depth, minimum samples required to split a node, minimum samples required at a leaf node, and whether to use bootstrap sampling. By fitting the model with different combinations of these parameters, I identified the best-performing set of hyperparameters that optimized the model's predictive capability.

To ensure the robustness and reliability of the model, I employed k-fold cross-validation. This technique involves splitting the dataset into multiple subsets (or folds), training the model on a portion of the data while validating it on the remaining part. By averaging the performance across all folds, I obtained a more reliable estimate of the model's accuracy and reduced the risk of overfitting.

Splitting the Data: The dataset is divided into 5 equal-sized (or nearly equal-sized) subsets (folds).

Training and Validation: For each fold, the model is trained on 4 of the folds and validated on the remaining fold. This process is repeated 5 times, with each fold serving as the validation set once.

Aggregating Results: After training and validating across all folds, the performance metrics (e.g., accuracy) from each fold are averaged to provide a more reliable estimate of the model's performance.

The final results included various performance metrics such as accuracy, precision, recall, F1 score, and cross-validation scores, which helped me assess the effectiveness of the model comprehensively. Overall, these steps aimed to enhance the model's predictive power while ensuring its generalizability to unseen data.

The output presents the results of tuning a Random Forest Classifier model to predict customer churn. After conducting hyperparameter tuning, the best hyperparameters were identified as follows: bootstrap set to False, max_depth at 15, min_samples_leaf at 1, min_samples_split at 5, and n_estimators at 300. These parameters were selected to optimize the model's performance and enhance its ability to generalize to unseen data.

bootstrap (False):

This parameter determines whether bootstrap samples are used when building trees. When set to True, each tree in the forest is trained on a random sample of the data, which is obtained with replacement. Setting it to False means that the model uses the entire dataset to build each tree, which can lead to more stable trees but may also result in overfitting if the model becomes too complex.

max_depth (15):

This parameter specifies the maximum depth of each tree in the forest. Limiting the depth helps prevent overfitting, where the model learns noise in the training data rather than the underlying patterns. A max depth of 15 means that no tree in the forest will be allowed to grow beyond 15 levels deep.

min_samples_leaf (1):

This parameter sets the minimum number of samples that must be present in a leaf node. A value of 1 indicates that a leaf node must have at least one samples to be created. This helps in ensuring that leaf nodes contain enough data points to make reliable predictions, thereby preventing the model from being overly complex and potentially overfitting.

min_samples_split (5):

This parameter defines the minimum number of samples required to split an internal node. A value of 5 means that a node will be split only if there are at least ten samples present. This is a common setting and allows the trees to grow as long as there are sufficient data points, but it can lead to deeper trees if combined with a high max depth.

n_estimators (300):

This parameter specifies the number of trees in the forest. Setting n_estimators to 300 means that the Random Forest model will consist of 200 individual decision trees. Generally, more trees can lead to better performance and more stable predictions, but they also increase computation time.

The classification report provides detailed metrics for evaluating the model's performance on the test set. The precision for predicting non-churning customers (label 0) is approximately 0.87, while for churning customers (label 1), it is 0.83. This indicates that the model is fairly accurate in identifying non-churning customers while also being competent in predicting churn. The recall scores suggest that the model is able to identify about 83% of actual non-churning customers and 87% of actual churning customers, demonstrating a strong ability to capture both classes. The F1 scores, which balance precision and recall, are around 0.85 qnd 85%, indicating that the model performs equally well across both classes.

The overall accuracy of the model is 84.70%, suggesting that the model correctly predicts the churn status for nearly 85% of the test instances. The weighted average scores for precision, recall, and F1 score all align closely with the overall accuracy, indicating consistent performance across the majority class and the minority class.

Finally, the cross-validation score, calculated at 85.16%, reflects the model's performance across multiple subsets of the data, ensuring robustness and reliability in its predictive capabilities. The negligible difference of -0.46 between the accuracy score and the cross-validation score suggests that the model is not overfitting, as it performs consistently across different data splits. This outcome emphasizes the effectiveness of the Random Forest model in predicting customer churn, as it balances complexity and generalization effectively.

The performance metrics of your Random Forest Classifier before and after model optimization indicate that the overall model performance has remained relatively consistent, but with slight variations in the scores. Here's a breakdown of the observations and possible reasons for these changes:

We can see that the following changes between the before and after optimized Random forest Classifier.

Stability in Performance Metrics:

The precision, recall, and F1 scores remain very similar (around 84-85%), indicating that the model's ability to classify positive and negative classes hasn't significantly changed. The accuracy score decreased slightly after optimization, from 85.18% to 84.70%. This could be due to the grid search process finding hyperparameters that lead to a more balanced model, potentially improving generalization at the cost of minor accuracy drops.

Cross-Validation Score:

The cross-validation score also slightly increased from 84.89% to 85.16%. This suggests that the optimized model might be slightly less robust than the previous one,

Hyperparameter Impact:

Hyperparameter tuning is designed to enhance model performance and generalization. However, it often leads to subtle changes in performance metrics, which is a common occurrence in machine learning. The optimal parameters may not yield drastic improvements. The optimization process may have favored hyperparameters that balance precision and recall rather than maximizing accuracy alone, resulting in slight adjustments to the reported scores.

I have used the below code to further optimise the Random Forest Model.

In [ ]:
# Import necessary libraries
from sklearn.model_selection import GridSearchCV, train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score, f1_score

# Define a function to classify and evaluate the model
def classify_with_tuning(model, X, Y):
    # Split the data into training (70%) and testing (30%) sets
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state=10)

    # Hyperparameter tuning using Grid Search
    param_grid = {
        'n_estimators': [100, 200, 300, 400, 500],  # Increased number of estimators
        'max_depth': [10, 15, 20, None],  # Allowing unlimited depth
        'min_samples_split': [2, 5, 10, 15],  # Added more options
        'min_samples_leaf': [1, 2, 4, 6],  # Added more options
        'bootstrap': [True, False],
        'class_weight': [None, 'balanced']  # Adjusting class weights
    }

    grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2)

    # Fit the grid search model
    grid_search.fit(X_train, Y_train)

    # Best parameters after tuning
    print("Best Hyperparameters after tuning:\n", grid_search.best_params_)

    # Train the model with the best parameters
    best_model = grid_search.best_estimator_

    # Predict Y_test
    pred = best_model.predict(X_test)

    # Classification Report
    class_report = classification_report(Y_test, pred)
    print("\nClassification Report:\n", class_report)

    # Accuracy Score
    acc_score = accuracy_score(Y_test, pred) * 100
    print("Accuracy Score:", acc_score)

    # Precision, Recall, F1 Score
    precision = precision_score(Y_test, pred, average='weighted') * 100
    recall = recall_score(Y_test, pred, average='weighted') * 100
    f1 = f1_score(Y_test, pred, average='weighted') * 100

    print("Precision Score:", precision)
    print("Recall Score:", recall)
    print("F1 Score:", f1)

    # Cross Validation Score
    cv_score = cross_val_score(best_model, X, Y, cv=5).mean() * 100
    print("Cross Validation Score:", cv_score)

    # Result of accuracy minus cv scores
    result = acc_score - cv_score
    print("\nAccuracy Score - Cross Validation Score is", result)

# Initialize Random Forest Classifier
model = RandomForestClassifier(random_state=111)

# Call the function to classify and tune
classify_with_tuning(model, X, Y)
Fitting 5 folds for each of 1280 candidates, totalling 6400 fits
Best Hyperparameters after tuning:
 {'bootstrap': False, 'class_weight': None, 'max_depth': 15, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 500}

Classification Report:
               precision    recall  f1-score   support

           0       0.86      0.83      0.85      1581
           1       0.83      0.87      0.85      1524

    accuracy                           0.85      3105
   macro avg       0.85      0.85      0.85      3105
weighted avg       0.85      0.85      0.85      3105

Accuracy Score: 84.66988727858293
Precision Score: 84.7445527708363
Recall Score: 84.66988727858293
F1 Score: 84.66894275624607
Cross Validation Score: 85.09963271948689

Accuracy Score - Cross Validation Score is -0.4297454409039574

In the second implementation of the hyperparameter tuning and cross-validation process for the Random Forest model, I made several adjustments to enhance the model's performance. I expanded the hyperparameter grid by increasing the range of n_estimators to include values from 100 to 500, allowing for a greater number of trees in the ensemble, which can improve model accuracy. Additionally, I included None as an option for max_depth, permitting the model to grow trees without restrictions, which can be beneficial in capturing complex patterns in the data. I also added more options for min_samples_split and min_samples_leaf, incorporating values of 15 and 6, respectively, to explore a wider variety of tree growth configurations. Furthermore, I introduced the class_weight parameter, allowing the model to adjust the weights of classes to address any potential class imbalance in the dataset. These enhancements aim to optimize the model's ability to generalize and improve its predictive performance, especially after observing a decrease in accuracy following the initial optimization.

Here we can see that the accuracy have reduced further. Where the before hyperparameter tuned models accuracy is 85.18% here its 84.67%.

Hyperparameter Tuning and Cross-Validation for Extra Tree Model

In [64]:
# Define the function for hyperparameter tuning and cross-validation
def classify_with_tuning(model, X, Y):
    # Split the data into training (70%) and testing (30%) sets
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state=10)

    # Define the hyperparameter grid for tuning
    param_grid = {
        'n_estimators': [100, 200, 300],         # Number of trees
        'max_depth': [10, 15, 20],               # Maximum depth of each tree
        'min_samples_split': [2, 5, 10],         # Minimum number of samples to split
        'min_samples_leaf': [1, 2, 4],           # Minimum number of samples per leaf
        'bootstrap': [True, False]               # Use bootstrapping or not
    }

    # Use Grid Search to find the best parameters
    grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2)

    # Fit the model on training data
    grid_search.fit(X_train, Y_train)

    # Get the best parameters after tuning
    print("Best Hyperparameters after tuning:\n", grid_search.best_params_)

    # Use the best model found by Grid Search
    best_model = grid_search.best_estimator_

    # Predict on the test set
    pred = best_model.predict(X_test)

    # Classification Report
    print("\nClassification Report:\n", classification_report(Y_test, pred))

    # Accuracy Score
    acc_score = accuracy_score(Y_test, pred) * 100
    print("Accuracy Score:", acc_score)

    # Precision, Recall, F1 Score
    precision = precision_score(Y_test, pred, average='weighted') * 100
    recall = recall_score(Y_test, pred, average='weighted') * 100
    f1 = f1_score(Y_test, pred, average='weighted') * 100

    print("Precision Score:", precision)
    print("Recall Score:", recall)
    print("F1 Score:", f1)

    # Apply 5-fold Cross-Validation on the best model
    cv_scores = cross_val_score(best_model, X, Y, cv=5)
    print("Cross-Validation Scores:", cv_scores)
    print("Mean Cross-Validation Score:", cv_scores.mean() * 100)

# Initialize the Extra Trees Classifier
model = ExtraTreesClassifier(random_state=111)

# Call the function to classify with tuning and cross-validation
classify_with_tuning(model, X, Y)
Fitting 5 folds for each of 162 candidates, totalling 810 fits
Best Hyperparameters after tuning:
 {'bootstrap': False, 'max_depth': 20, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200}

Classification Report:
               precision    recall  f1-score   support

           0       0.87      0.83      0.85      1581
           1       0.83      0.87      0.85      1524

    accuracy                           0.85      3105
   macro avg       0.85      0.85      0.85      3105
weighted avg       0.85      0.85      0.85      3105

Accuracy Score: 85.2818035426731
Precision Score: 85.37728166348363
Recall Score: 85.2818035426731
F1 Score: 85.27974253432762
Cross-Validation Scores: [0.80531401 0.82028986 0.89903382 0.90381827 0.90865152]
Mean Cross-Validation Score: 86.74214946659103

In this process, the Extra Trees Classifier model was first subjected to hyperparameter tuning using Grid Search to optimize key parameters such as the number of trees (n_estimators), tree depth (max_depth), minimum samples required to split a node (min_samples_split), minimum samples required to be a leaf (min_samples_leaf), and whether or not to use bootstrapping (bootstrap). A total of 162 different hyperparameter combinations were tested, and the best-performing combination was selected based on cross-validation. The best parameters identified were bootstrap: False, max_depth: 20, min_samples_leaf: 1, min_samples_split: 2, and n_estimators: 200.

After selecting the optimal hyperparameters, the model was evaluated on unseen test data, yielding an accuracy score of 85.28%, a precision score of 85.37%, a recall score of 85.28%, and an F1 score of 85.28%. These metrics indicate the model's balanced performance in predicting both classes (churn and non-churn).

Finally, 5-fold cross-validation was applied to assess the model’s robustness, producing cross-validation scores ranging from 80.5% to 90.86%. The mean cross-validation score was 86.74%, which suggests that the model generalizes well across different subsets of the data. The small difference between the accuracy score on the test data and the mean cross-validation score indicates a reliable and robust model after tuning.

We can see that the best model is the after Hyperparameter tuned Extra Tree clasifier Model which has the highest accuracy of 85.28%.¶

In [65]:
# Initialize Extra Trees Classifier with the best hyperparameters found earlier
best_model = ExtraTreesClassifier(
    bootstrap=False,
    max_depth=20,
    min_samples_leaf=1,
    min_samples_split=2,
    n_estimators=200,
    random_state=111
)

# Fit the best_model on your entire dataset
best_model.fit(X, Y)
Out[65]:
ExtraTreesClassifier(max_depth=20, n_estimators=200, random_state=111)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
ExtraTreesClassifier(max_depth=20, n_estimators=200, random_state=111)

I initialize an ExtraTreesClassifier using the hyperparameters that were found to be optimal during the previous hyperparameter tuning process. These hyperparameters include setting bootstrap to False (indicating that each tree will be trained on the entire dataset, not on bootstrapped samples), max_depth of the trees to 20 (limiting the depth of the decision trees), and specific values for min_samples_leaf (minimum number of samples required to be at a leaf node) and min_samples_split (minimum number of samples required to split an internal node). Additionally, the model is set to use 200 trees (`n_es

datasetda

In [66]:
# Define the function to visualize feature importance and cross-validation scores
def visualize_model_performance(best_model, X, Y):
    # Plot Feature Importance
    feature_importances = best_model.feature_importances_
    sorted_idx = np.argsort(feature_importances)[::-1]  # Sort in descending order of importance
    features = X.columns

    plt.figure(figsize=(10, 6))
    plt.title('Feature Importances of the Extra Trees Classifier')
    plt.barh(range(len(sorted_idx)), feature_importances[sorted_idx], align='center')
    plt.yticks(range(len(sorted_idx)), [features[i] for i in sorted_idx])
    plt.xlabel('Relative Importance')
    plt.show()

    # Cross-Validation Score Visualization
    cv_scores = cross_val_score(best_model, X, Y, cv=5)
    plt.figure(figsize=(8, 4))
    plt.plot(range(1, 6), cv_scores, marker='o', linestyle='-', color='b', label='CV Score')
    plt.title('Cross-Validation Scores')
    plt.xlabel('Fold')
    plt.ylabel('Score')
    plt.ylim([0.75, 1.0])  # Setting y-axis limits to visualize well
    plt.grid(True)
    plt.legend(loc='lower right')
    plt.show()

# Call the function to visualize the performance
visualize_model_performance(best_model, X, Y)

The code provided is used to visualize the feature importance of an Extra Trees Classifier and the performance of cross-validation. The function visualize_model_performance takes the best-tuned Extra Trees model (best_model) and the dataset (X, Y) as inputs.

  1. Feature Importance Plot: The model's feature importances are first calculated using the feature_importances_ attribute, which gives a relative ranking of how important each feature was in making decisions during the classification. The features are sorted in descending order, and a horizontal bar chart is plotted to visualize the importance of each feature. This allows us to see which features have the greatest influence on the classifier's decisions. In the output, Contract and tenure are the most important features, while PhoneService has the least influence.

  2. Cross-Validation Score Plot: The cross-validation scores are calculated using cross_val_score with 5-fold cross-validation, and the scores are plotted for each fold. This visualization provides insight into how well the model generalizes to unseen data. From the output, we can observe that the cross-validation score improves from fold 1 to fold 5, indicating a steady performance increase across the different splits.

Together, these plots give an overview of both the most important features for classification and the model's robustness through cross-validation.

In [72]:
def classify(model, X, Y):
    # Split the data into training (70%) and testing (30%) sets
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state=10)

    # Fit the model on the training data
    model.fit(X_train, Y_train)

    # Predict the probabilities for the positive class (1)
    y_prob = model.predict_proba(X_test)[:, 1]

    # Compute ROC curve and AUC score
    fpr, tpr, thresholds = roc_curve(Y_test, y_prob)
    roc_auc = roc_auc_score(Y_test, y_prob)

    # Plot the ROC curve
    plt.figure(figsize=(8, 6))
    plt.plot(fpr, tpr, color='blue', label=f'AUC = {roc_auc:.2f}')
    plt.plot([0, 1], [0, 1], color='red', linestyle='--')  # Diagonal line
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver Operating Characteristic (ROC) Curve')
    plt.legend(loc="lower right")
    plt.grid()
    plt.show()



classify(best_model, X, Y)

The AUC-ROC curve is a key performance evaluation tool for binary classification models. It plots the True Positive Rate (TPR), also known as sensitivity or recall, against the False Positive Rate (FPR) at various classification thresholds. The True Positive Rate measures the proportion of actual positive instances that are correctly classified by the model, while the False Positive Rate measures the proportion of negative instances incorrectly classified as positive. The curve helps visualize how well the model distinguishes between the two classes (positive and negative) across different thresholds.

The AUC (Area Under the ROC Curve) quantifies the overall performance of the classifier. AUC values range from 0 to 1, where an AUC of 1 represents a perfect classifier, meaning it can perfectly distinguish between the two classes. An AUC of 0.5 represents a model with no discriminative ability, equivalent to random guessing. Values below 0.5 indicate a model performing worse than random guessing. Therefore, the closer the AUC is to 1, the better the model is at classifying the instances.

In the ROC curve shown in the image, the blue curve represents the performance of the model. The red dashed line serves as a baseline, showing the performance of a random classifier (AUC = 0.5), where the True Positive Rate is equal to the False Positive Rate. The model in this case has an AUC of 0.93, which indicates a strong performance. This score means that the classifier has a 93% chance of ranking a randomly chosen positive instance higher than a randomly chosen negative one. In other words, the model is highly effective in distinguishing between the positive and negative classes.

In [73]:
def plot_confusion_matrix(model, X, Y):
    # Split the data into training (70%) and testing (30%) sets
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state=10)

    # Fit the model on the training data
    model.fit(X_train, Y_train)

    # Predict the class labels for the test set
    Y_pred = model.predict(X_test)

    # Generate the confusion matrix
    cm = confusion_matrix(Y_test, Y_pred)

    # Plot the confusion matrix using a heatmap for better visualization
    plt.figure(figsize=(6, 5))
    sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", cbar=False,
                xticklabels=["Class 0", "Class 1"], yticklabels=["Class 0", "Class 1"])
    plt.title("Confusion Matrix")
    plt.xlabel("Predicted")
    plt.ylabel("Actual")
    plt.show()

# Example usage
plot_confusion_matrix(best_model, X, Y)

A confusion matrix is a performance evaluation tool used for classification models. It provides a detailed breakdown of the model's predictions by comparing actual class labels with predicted class labels. The matrix is organized into four key areas: True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN).

1315 True Negatives: These represent the cases where the model correctly classified Class 0 (negative class).

1333 True Positives: These are the instances where the model correctly predicted Class 1 (positive class).

266 False Positives: These represent the cases where the model mistakenly predicted Class 1 (positive) while the true label was Class 0.

191 False Negatives: These are cases where the model incorrectly classified the instances as Class 0 (negative) while they were actually Class 1 (positive).

The confusion matrix indicates that the model has a fairly strong performance in both classes, with higher values in the True Positives and True Negatives, showing that the majority of the predictions are correct. However, some errors (False Positives and False Negatives) still exist, which is typical in any classification task.

06. Insights and Business Recommendations¶

Based on the insights gained from the best-performing Extra Tree Classifier model and the EDA performed during the data mining process, several strategic business recommendations can be made to improve customer retention and overall decision-making within the organization.

1. Targeted Retention Strategies.

The predictive model predicts a churn rate of roughly 26.5%, implying that a sizable proportion of consumers are expected to cancel their services. To overcome this, the company should develop focused retention measures for at-risk clients. Using the model's projections, the company can identify clients who have characteristics linked with higher churn risk, such as higher monthly prices or shorter tenure. Tailored outreach strategies, including targeted discounts, loyalty programs, and proactive customer care, can be implemented to engage these customers and reduce churn.

2. Price Strategy Adjustments

The study reveals a possible link between increasing monthly prices and customer attrition. It is critical to assess the pricing structure and consider implementing tiered pricing or promotional offers for high-risk sectors. Pricing changes to match consumer budgets while retaining profitability can improve customer happiness and retention. For example, offering limited-time discounts or bundling services at a lower cost could encourage users to stick with the business.

3. Enhancing Customer Experience.

Clients in their first few months of service consumption are especially sensitive to churn. As a result, it is crucial to focus on improving the onboarding process and providing great client experience during this critical phase. Implementing effective onboarding programs, communicating clearly, and delivering personalized support can foster a sense of value and commitment among new customers, reducing the likelihood of early churn.

4. Concentrate on senior citizens.

With a small number of senior citizens among the customer base, there is a chance to personalize services to their specific requirements. The company can recruit this group and build a loyal customer base by implementing tailored marketing techniques and service bundles aimed exclusively at senior adults. This could include simpler service alternatives, personalized communication, and specialized customer care channels.

5. Continuous Monitoring and Adjustment.

Finally, the firm must regularly monitor client data and model performance. Regularly updating the predictive model with new data will aid in the identification of emerging trends and shifting client habits.

Implementing these guidelines will help the firm enhance customer retention rates, boost profitability, and build long-term client loyalty.